The Architecture of Genome Wide Association Studies

Abstract : Over 3,000 genome-wide association studies (GWAS) across a decade of research have discovered genetic risk variants and biological pathways that play a role in specific diseases, disorders and behavioural traits. The NHGRI-EBI GWAS Catalog indexes all published studies, to which we construct, harmonise and then merge various data sources and conduct multiple detailed analyses in order to provide the first systematic, data-driven examination of GWAS to date. We pursue multiple distinct avenues of interest; participants, funders, researchers, traits, cohorts and consortium. We first replicate and substantially expand existing analysis with regard to participant ancestry over time, adding further evidence of bias towards recruitment from European and American countries. We then provide an analysis of who funds GWAS (and what they fund): over 85% of acknowledgements relate to NIH grants, with almost the entirety of the remainder relating to UK agencies. A network analysis of GWAS authors shows a dense group of Principle Investigators contributing large datasets, reflected in centrality measures and an original ‘GWAS H-Index’. Approximately 40% of authors are female, although men dominate the typical ‘senior author’ position (71%). We also find a gendered division across the studied traits, where the dominant categories relate to neurological disorders, neoplasm and drug responses. An extensive manual data collection exercise summarises the most frequently utilised cohorts for which we collate a range of summary descriptives, and we also provide an overview of the main consortium in play. In light of these findings, we discuss challenges and solutions for the future of genomics.

1. Preamble

This is an iPython Notebook to accompany the paper entitled 'The Architecture of Genome Wide Association Studies' by Melinda Mills and Charles Rahal. This allows us to share data analysis done in the construction of this data-driven review of all GWAS to date. It can also be used to dynamically replicate the analysis herein going forward. For installation guidelines, please see the readme.md accompanies this repository\clone\zip. A preliminary point to note is that if you wish to run this .ipynb file itself, you may need to override the default settings of some versions of Jupyter Notebook (4.2 to 5.1?) by opening with:

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

or editing your jupyter_notebook_config.py file. Lets begin by loading in our favourite python modules, ipython magic and a bunch of custom functions we've written specifically for this project:

In [1]:
import networkx as nx
import os
import pandas as pd
import re
import requests
import seaborn as sns
import sys
import numpy as np
import csv
import itertools
import warnings

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib as mpl
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
from matplotlib.colors import Normalize
from matplotlib import rc

import gender_guesser.detector as gender
from Bio import Entrez
from PIL import Image
import requests_ftp

from wordcloud import WordCloud, STOPWORDS
from IPython.display import HTML, display

from Support.LoadData import (
    EFO_parent_mapper, load_gwas_cat, load_pubmed_data, 
    make_timely, make_clean_CoR
)
from Support.PubMed import (
    build_collective, build_author, build_funder,
    build_abstract, build_citation
)
from Support.Ancestry import ancestry_cleaner
from Support.Additional import (grey_color_func, clean_names)

warnings.filterwarnings('ignore')
mpl.rcParams.update(mpl.rcParamsDefault)
%config InlineBackend.figure_format = 'png'
%matplotlib inline
%load_ext autoreload
%autoreload 2

Then, let's dynamically grab the three main curated datasets from the GWAS Catalogue EBI website that we will need for our endeavours ('All Associations','All Studies', 'All Ancestries') and the EFO to Parent trait mapping file from their ftp:

In [2]:
#r = requests.get(
#    'https://www.ebi.ac.uk/gwas/api/search/downloads/studies_alternative')
#with open(os.path.abspath(
#          os.path.join('__file__', '../..', 'Data', 'Catalogue', 'Raw',
#                       'Cat_Stud.tsv')), 'wb') as tsvfile:
#        tsvfile.write(r.content)
#r = requests.get(
#    'https://www.ebi.ac.uk/gwas/api/search/downloads/ancestry')
#with open(os.path.abspath(
#          os.path.join('__file__', '../..', 'Data', 'Catalogue', 'Raw',
#                       'Cat_Anc.tsv')), 'wb') as tsvfile:
#    tsvfile.write(r.content)
#r = requests.get('https://www.ebi.ac.uk/gwas/api/search/downloads/full')
#with open(os.path.abspath(
#          os.path.join('__file__', '../..', 'Data', 'Catalogue', 'Raw',
#                       'Cat_Full.tsv')), 'wb') as tsvfile:
#    tsvfile.write(r.content)
#requests_ftp.monkeypatch_session()
#s = requests.Session()
#r = s.get('ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/latest/gwas-efo-trait-mappings.tsv')
#with open(os.path.abspath(os.path.join('__file__', '../..', 'Data', 'Catalogue',
#                                       'Raw', 'Cat_Map.tsv')), 'wb') as tsvfile:
#    tsvfile.write(r.content)

Lets link PUBMEDID variable to the PUBMED API and get a series of datasets from that using the support functions written in PubMed.py. Note: collective corresponds mostly consortia and study groups.

In [3]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
id_list = set(Cat_Studies['PUBMEDID'].astype(str).tolist())
#Entrez.email = 'your.email@goeshere.com'
#papers = Entrez.read(Entrez.efetch(db='pubmed', retmode='xml', id=','.join(id_list)))
#build_collective(papers)
#build_author(papers)
#build_funder(papers)
#build_abstract(papers)
#build_citations(id_list,Entrez.email)

2. Introduction

2.1 Some Simple 'Facts'

Lets do some basic descriptive analysis to see what we've got to create scalars to go in-line into the text:

In [4]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()

print('There are currently: ' + str(len(Cat_Studies['PUBMEDID'].unique())) +
      ' GWAS papers published (in terms of unique PUBMEDIDs)')
print('The first observed GWAS in the catalogue was dated: ' +
      str(Cat_Studies['DATE'].min()))
print('However, only: ' + str(len(Cat_Studies[Cat_Studies['DATE'].str.contains(
    '2005|2006')])) + ' papers were published in 2005 and 2006 combined')
print('There are currently: ' +
      str(len(Cat_Studies['STUDY ACCESSION'].unique())) +
      ' unique Study Accessions')
print('There are currently: ' +
      str(len(Cat_Studies['DISEASE/TRAIT'].unique())) +
      ' unique Diseases/Traits studied')
print('These correspond to: ' +
      str(len(Cat_Studies['MAPPED_TRAIT'].unique())) +
      ' unique EBI "Mapped Traits"')
print('The total number of Associations found is currently: ' +
      str(round(Cat_Studies['ASSOCIATION COUNT'].sum(), 1)))
print('The average number of Associations is currently: ' +
      str(round(Cat_Studies['ASSOCIATION COUNT'].mean(), 1)))
print('Mean P-Value for the strongest SNP risk allele is currently: ' +
      str(round(Cat_Full['P-VALUE'].astype(float).mean(), 10)))
print('The number of associations reaching the 5e-8 threshold is ' +
      str(len(Cat_Full[Cat_Full['P-VALUE'].astype(float) < 5.000000e-8])))
print('The journal to feature the most GWAS studies since ' +
      str(Cat_Studies['DATE'].min()) +
      ' is: ' + Cat_Studies['JOURNAL'].mode()[0])
print('However, in 2017, ' + Cat_Studies[Cat_Studies['DATE'].str.contains(
    '2017')]['JOURNAL'].mode()[0] +
    ' published the largest number of GWAS papers')
print('Largest Accession to date: ' +
      str(Cat_Ancestry_groupedbyN.sort_values(by='N',
                                              ascending=False)['N'].iloc[0]) +
      ' people and was published in ' +
      str(Cat_Studies[
          Cat_Studies['STUDY ACCESSION'] ==
          Cat_Ancestry_groupedbyN.sort_values(by='N',
                                              ascending=False)['STUDY ACCESSION'].iloc[0]]['JOURNAL'].iloc[0]) +
      '. The first author was: ' +
      str(Cat_Studies[
          Cat_Studies['STUDY ACCESSION'] ==
          Cat_Ancestry_groupedbyN.sort_values(by='N',
                                              ascending=False)['STUDY ACCESSION'].iloc[0]]
          ['FIRST AUTHOR'].iloc[0]) + '.')
print('Total number of SNP-trait associations is ' +
      str(Cat_Studies['ASSOCIATION COUNT'].sum()) + '.')
print('Total number of journals publishing GWAS is ' +
      str(len(Cat_Studies['JOURNAL'].unique())) + '.')
print('The study with the largest number of authors has: ' +
      str(AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count().max()) +
      ' authors.')
There are currently: 3314 GWAS papers published (in terms of unique PUBMEDIDs)
The first observed GWAS in the catalogue was dated: 2005-03-10
However, only: 10 papers were published in 2005 and 2006 combined
There are currently: 4943 unique Study Accessions
There are currently: 2909 unique Diseases/Traits studied
These correspond to: 2191 unique EBI "Mapped Traits"
The total number of Associations found is currently: 67230
The average number of Associations is currently: 13.6
Mean P-Value for the strongest SNP risk allele is currently: 1.5748e-06
The number of associations reaching the 5e-8 threshold is 33026
The journal to feature the most GWAS studies since 2005-03-10 is: Nat Genet
However, in 2017, Nat Commun published the largest number of GWAS papers
Largest Accession to date: 616919.0 people and was published in J Clin Endocrinol Metab. The first author was: Hu Y.
Total number of SNP-trait associations is 67230.
Total number of journals publishing GWAS is 436.
The study with the largest number of authors has: 559 authors.

2.2 Can the Abstracts Give us any Clues?

Lets make a nice infographic and do some basic NLP on the abstracts from all PUBMED IDs. This figure is the header on the readme.md:

In [5]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
words_array = []
with open(os.path.abspath(
          os.path.join('__file__', '../..', 'Data', 'PUBMED',
                       'Pubmed_AbstractCount.csv')),
          'r', errors='replace') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row['word'].lower() not in STOPWORDS:
            if len(row['word']) > 3:
                words_array.append(
                    (row['word'].upper(), float(row['num_words'])))
mask = Image.new('RGBA', (8000, 4467))
icon = Image.open(os.path.abspath(
                  os.path.join('__file__', '../..', 'Data', 'Support',
                               'doublehelix_mask.png'))).convert('RGBA')
mask.paste(icon, icon)
mask = np.array(mask)
wc = WordCloud(background_color='white', max_words=1250, mask=mask,
               max_font_size=5000)
wc.generate_from_frequencies(dict(words_array))
wc.recolor(color_func=grey_color_func)
wc.to_file(os.path.abspath(
           os.path.join('__file__', '../../', 'Figures', 'pdf',
                        'helix_wordcloud_1250_5000_black.pdf')))
plt.figure(figsize=(16, 8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

The file Pubmed_AbstractCount in PUBMED (subdirectory of Data) details a breakdown of the abstract counts. Check counts for 'European', 'Asian', etc.

2.3. The Growth in depth and scope of GWAS over time

We can also visualise the ever increasing sample sizes and popularity of GWAS studies. The top panel shows the increasing number of GWAS conducted over time. This is partly due to the increasingly large sample sizes: the fraction of 'Big N' sample size studies increases. The bottom left subplot shows that with this growth comes bigger N and a high correlatation with the number of significant Associations found. The right subplot shows the unique number of journals publishing GWASs per year, as well as the unique number of diseases/traits studied also steadily increases as the appeal of GWASs broadens.

In [6]:
plt.style.use('seaborn-ticks')
plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.linewidth'] = 0.75
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
yearlist = []
yearquarterlist = []
for year in range(2007, 2018):
    yearlist.append(str(year))
    for quarter in ['Q1', 'Q2', 'Q3', 'Q4']:
        yearquarterlist.append(str(year) + quarter)
variables = ['N ≤ 5,000', '5,001 ≤ N ≤ 50,000', '50,001 ≤ N ≤ 100,000',
             '100,001 ≤ N', 'N', 'Associations', 'Journals Printing GWAS',
             '# Diseases Studied']
df_years, df_quarters = make_timely(variables, yearlist, yearquarterlist,
                                    Cat_Studies, Cat_Ancestry,
                                    Cat_Ancestry_groupedbyN)
plt.figure(figsize=(15, 10))
axA = plt.subplot(2, 1, 1)
ax0variables = ['N ≤ 5,000', '5,001 ≤ N ≤ 50,000',
                '50,001 ≤ N ≤ 100,000', '100,001 ≤ N']
ax0 = df_quarters[ax0variables].plot(kind='bar', stacked=True, ax=axA,
                                     color=['orangered', 'steelblue',
                                            'goldenrod', 'forestgreen'],
                                     alpha=0.75, edgecolor='k')
ax0.set_ylabel('Number of Study Accessions', fontsize=12)
ax0.tick_params(labelsize=10)
ax0.legend(fontsize=12, loc='upper left')
ax0.grid(b=True, which='major', color='#d3d3d3', linestyle='--', alpha=0.75)
axB = plt.subplot(2, 2, 3)
ax1a = df_years[['Associations']].plot(ax=axB, color='orangered', alpha=0.75,
                                       rot=90, marker='o', linewidth=1.5,
                                       markersize=8,
                                       label='Associations Discovered',
                                       markeredgecolor='k',
                                       markeredgewidth=0.5)
ax1b = axB.twinx()
ax1b.plot(df_years[['N']], color='steelblue', marker='s', markersize=7,
          linewidth=1.5, markeredgecolor='k', markeredgewidth=0.5)
ax1a.set_ylabel('Number of Associations Discovered', fontsize=12)
ax1b.set_ylabel('Number of Study Participants Analyzed', fontsize=12)
ax1b.grid(False)
axB.plot(0, 0, '-r', color='steelblue', marker='s', markersize=7,
         markeredgecolor='k', markeredgewidth=0.5)
axB.legend(['Associations (left)', 'Participants (right)'],
           fontsize=12, loc='upper left')
ax1a.tick_params(labelsize=10)
ax1b.tick_params(labelsize=10)
axB.grid(b=True, which='major', color='#d3d3d3', linestyle='--', alpha=0.75)
plt.axis('tight')

axC = plt.subplot(2, 2, 4)
axtest = axC.twinx()
ax_2a = df_years[['Journals Printing GWAS']].plot(kind='bar',
                                                  ax=axC, position=1,
                                                  color='orangered',
                                                  legend=False, width=0.35,
                                                  alpha=0.75, edgecolor='k')
ax_2b = df_years[['# Diseases Studied']].plot(kind='bar', ax=axtest,
                                              position=0, color='steelblue',
                                              width=0.35, legend=False,
                                              alpha=0.75, edgecolor='k')
ax_2a.set_ylabel('Unique Number of Journals Publishing GWAS', fontsize=12)
ax_2b.set_ylabel('Unique Number of Diseases Studied', fontsize=12)
ax_2b.grid(False)
axC.plot(np.nan, 'orangered', linewidth=4)
axC.plot(np.nan, 'steelblue', linewidth=4)
axC.legend(['Journals (left)', 'Diseases (right)'],
           fontsize=12, loc='upper left')
ax_2a.margins(1, 0.5)
ax_2a.tick_params(labelsize=10)
ax_2b.tick_params(labelsize=10)
axC.grid(b=True, which='major', color='#d3d3d3', linestyle='--', alpha=0.75)

plt.axis('tight')
plt.tick_params(axis='both', which='minor', labelsize=10)
plt.tight_layout()

plt.savefig(os.path.abspath(os.path.join('__file__', '../..',
                                         'Figures', 'svg',
                                         'GWAS_Popularity.svg')),
            bbox_inches='tight')

3. Participants

3.1 Ancestry

This section of analysis only uses data contained within the GWAS catalogue, but is slightly deeper than Section 2 above and related papers in the literature. We use the 'Broad Ancestral Category' field and use it to aggregate from 135 combinations of 17 ancestries to 7 'broader ancestral categories'. To get a more detailed analysis, we load in some of our supporting functions to clean up the "INITIAL SAMPLE SIZE" and "REPLICATION SAMPLE SIZE" free-text fields in the catalogue and then plot something analogous to Popejoy and Fullerton. The supporting functions are based on regular expression and data wrangling techniques which exploit patterns in the these free-text fields. By way of a very simple example: “19,546 British ancestry individuals from 6863 families.” will get cleaned to two seperate fields: “19,546” and “British” which can then be used for further downstream analysis. A slightly more complex example: “2,798 European ancestry individuals, 228 French Canadian founder population individuals” will correspond to two entries of 2798 and 228 in the new ‘Cleaned N’ type variable, corresponding to ‘European’ and ‘French Canadian’ in the ‘Cleaned Ancestry’ type variable respectively. Note that the figure only focuses on specific countries or ancestries which are mentioned and ignores the very broad mentions of continental or aggregated ancestries such as "European", "African American", "Asian", etc. However, these can still be found in the lists below, decomposed into both Initial and Replication (based on the 'INITIAL SAMPLE SIZE' and 'REPLICATION SAMPLE SIZE' free text fields accordingly). We first tabulate and list the detailed ancestries in first the initial then the replication samples, then aggregate between the two into a table, and then visualise it:

In [7]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Cat_Ancestry['Broader Ancestral'] = Cat_Ancestry['Broader Ancestral'].str.replace("/","\\")    
Cat_Studies['InitialClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'INITIAL SAMPLE SIZE'), axis=1)
Cat_Studies['ReplicationClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'REPLICATION SAMPLE SIZE'), axis=1)
fileout = open(os.path.abspath(
               os.path.join('__file__', '../..', 'Data',
                            'Catalogue', 'Synthetic',
                            'new_initial_sample.csv')),
               'w', encoding='utf-8')
fileout.write('STUDY ACCESSION,Cleaned Ancestry,Cleaned Ancestry Size\n')
for index, row in Cat_Studies.iterrows():
    checksum = 0
    for ancestry in row['InitialClean'].split(';'):
        number = re.findall(r'\d+', ancestry.strip())
        if (len(number) == 1):
            checksum += 1
    if checksum == len(row['InitialClean'].split(';')):
        for ancestry in row['InitialClean'].split(';'):
            number = re.findall(r'\d+', ancestry.strip())
            words = ''.join(i for i in ancestry.strip() if not i.isdigit())
            if (len(number) == 1) and (len(words.strip()) > 3) and \
               (sum(1 for c in words if c.isupper()) > 0):
                fileout.write(row['STUDY ACCESSION'] + ',' +
                              words.strip() + ',' + str(number[0]) + '\n')
fileout.close()
fileout = open(os.path.abspath(os.path.join('__file__', '../..', 'Data',
                                            'Catalogue', 'Synthetic',
                                            'new_replication_sample.csv')),
               'w', encoding='utf-8')
fileout.write('STUDY ACCESSION,Cleaned Ancestry,Cleaned Ancestry Size\n')
for index, row in Cat_Studies.iterrows():
    checksum = 0
    for ancestry in row['ReplicationClean'].split(';'):
        number = re.findall(r'\d+', ancestry.strip())
        if (len(number) == 1):
            checksum += 1
    if checksum == len(row['ReplicationClean'].split(';')):
        for ancestry in row['ReplicationClean'].split(';'):
            number = re.findall(r'\d+', ancestry.strip())
            words = ''.join(i for i in ancestry.strip() if not i.isdigit())
            if (len(number) == 1) and (len(words.strip()) > 3) and \
               (sum(1 for c in words if c.isupper()) > 0):
                fileout.write(row['STUDY ACCESSION'] + ',' +
                              words.strip() + ',' + str(number[0]) + '\n')
fileout.close()

clean_intial = pd.read_csv(os.path.abspath(
                           os.path.join('__file__', '../..', 'Data',
                                        'Catalogue', 'Synthetic',
                                        'new_initial_sample.csv')),
                           encoding='utf-8')
clean_initial_sum = pd.DataFrame(
    clean_intial.groupby(['Cleaned Ancestry']).sum())
clean_initial_sum.rename(
    columns={'Cleaned Ancestry Size': 'Ancestry Sum', }, inplace=True)
clean_initial_count = clean_intial.groupby(['Cleaned Ancestry']).count()
clean_initial_count.rename(
    columns={'Cleaned Ancestry Size': 'Ancestry Count', }, inplace=True)
clean_initial_merged = clean_initial_sum.merge(pd.DataFrame(
    clean_initial_count['Ancestry Count']), how='outer', left_index=True,
    right_index=True)
clean_initial_merged = clean_initial_merged.sort_values(
    by='Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_initial_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Ancestry Sum']) + \
        ',' + str(row['Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_initial_merged)) +
      ' ancestries found in the \'INITIAL SAMPLE SIZE\' column. ' +
      'These are: (number of people used used in all studies, ' +
      'number of studies included):\n\n' +
      holder[:-2] + '.\n\n')

clean_replication = pd.read_csv(os.path.abspath(
                                os.path.join('__file__', '../..', 'Data',
                                             'Catalogue', 'Synthetic',
                                             'new_replication_sample.csv')),
                                encoding='utf-8')
clean_replication_sum = pd.DataFrame(
    clean_replication.groupby(['Cleaned Ancestry']).sum())
clean_replication_sum.rename(
    columns={'Cleaned Ancestry Size': 'Ancestry Sum', }, inplace=True)
clean_replication_count = clean_replication.groupby(
    ['Cleaned Ancestry']).count()
clean_replication_count.rename(
    columns={'Cleaned Ancestry Size': 'Ancestry Count', }, inplace=True)
clean_replication_merged = clean_replication_sum.merge(
    pd.DataFrame(clean_replication_count['Ancestry Count']),
    how='outer', left_index=True, right_index=True)
clean_replication_merged = clean_replication_merged.sort_values(
    by='Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_replication_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Ancestry Sum']) + \
        ',' + str(row['Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_replication_merged)) +
      ' ancestries found in the \'REPLICATION SAMPLE SIZE\' column.' +
      ' These are (number of people used used in all studies, number of' +
      ' studies included):\n\n' + holder[:-2] + '.')
There are a total of 207 ancestries found in the 'INITIAL SAMPLE SIZE' column. These are: (number of people used used in all studies, number of studies included):

European (65506456,5043), British (1409322,31), African American (1365974,613), East Asian (1331034,166), Japanese (1203853,443), Han Chinese (480341,357), Korean (479869,225), Icelandic (418758,4), African (367474,121), Hispanic (344800,197), South Asian (318295,31), Finnish (312291,78), Hispanic/Latino (248416,55), Chinese (247743,152), Indian Asian (181580,44), African American/Afro Caribbean (161018,36), Latin American (149244,30), Indian (127166,48), Hispanic/Latino American (116226,12), Hispanic or Latino (83557,4), European and Asian (79845,2), Dutch (73925,16), Filipino (61239,41), Sardinian (60165,23), Swedish (59442,15), Latino (54928,41), Pomak (53847,31), European American (49345,25), Hispanic/Latin American (48109,7), Erasmus Rucphen (46182,26), Bangladeshi (46178,32), Mylopotamos (45756,31), Malay (41887,24), Mexican (40160,23), German (39536,15), Pakistani (38610,11), Asian (37473,75), Northern European (36459,12), European and East Asian (35524,2), European and South Asian (30482,2), Orcadian (30241,27), Mexican American (28638,39), Asian Indian (26798,9), Ashkenazi Jewish (23674,34), European and Hispanic (22604,1), Old Order Amish (21338,40), Italian (20264,12), Val Borbera (19754,15), French (18714,10), European and Indian Asian (17967,1), Saudi Arab (17030,8), Singaporean Chinese (14920,24), South East Asian (14706,10), Northern Finnish (14289,3), Northwestern European (13483,2), Western European (11043,3), Hispanic American (9444,6), Black (9165,22), African American and African (8298,2), Brazilian (8258,18), Danish (8247,4), Mexican and Latin American (8214,2), Sub Saharan African (7989,9), Turkish (7638,14), Austrian (7407,9), Korcula (7234,10), Puerto Rican (7096,9), Sorbian (6982,8), Mainland Hispanic/Latino (6734,1), European and Erasmus Rucphen (6597,1), Vietnamese (6469,4), Polish (6413,13), Lebanese (6308,6), European Barrett (6167,1), Friuli Venezia Giulia (5791,6), Norwegian (5704,6), Caribbean Hispanic (5607,4), Caribbean Hispanic/Latino (5458,1), Gambian (5038,4), Amish (5035,10), Thai (5008,14), Latino or African (4658,9), Punjabi Sikh (4619,5), Southern Chinese (4582,3), Indo European (4238,4), Kosraen (4168,2), Multiple (4088,1), Hutterite (3776,9), Carlantino (3667,10), Nigerian (3564,3), French Canadian (3537,9), American Indian (3446,11), West African (3434,5), Pima Indian (3251,8), Afro Caribbean (3201,4), Taiwanese (3104,6), Samoan (3072,1), African American or European (3054,2), African American/African Caribbean (3015,4), South Indian (2875,2), Native American North American (2835,2), North Indian (2692,4), Sereer (2640,5), South Tyrolean (2426,2), Micronesian (2346,1), Celtic (2307,1), Oceania (2229,2), Cilento (2163,3), Indonesian (2041,5), Sirente (2002,7), Middle Eastern (1950,2), Black Hispanic (1940,4), Native Hawaiian (1902,6), Moroccan (1851,9), Japanese American (1697,3), African American and African Caribbean (1612,1), Surinamese (1553,7), Latino American (1413,2), South Tyrol (1223,2), African and African Arab (1215,4), Tanzanian (1213,1), Sub-saharan African (1212,10), Spanish (1193,4), Sri Lankan Sinhalese (1154,4), Slavic (1102,1), Southern European (1090,1), South Tyrolian (1079,1), Latin/South American (1050,2), Yoruban (1047,3), Split (986,2), African Caribbean (929,2), Ogliastran (897,1), Ethiopian (895,6), Native American South American (875,2), African Arab (853,4), Fruili Venezia Giuli (843,1), Arab (836,6), Mongolian (756,1), Martu Australian Aboriginal (752,3), Central European (744,1), South African (733,2), Uyghur (721,1), Tyrolean (696,1), Mixed (684,1), Thai Chinese (618,2), Papua New Guinean (576,2), Bulgarian (564,2), European American and Mexican American (540,1), Vietnamese Korean (518,1), Costa Rican (506,2), Brazillian (503,1), Jewish (497,2), Isreali Arab (496,6), Middle Eastern/North African (482,4), Korkulan (472,1), Talana (470,1), Taiwanese Han (470,2), Irish (432,2), Malaysian Chinese (371,2), Russian (366,6), Iranian (352,2), Silk Road (348,1), Finnish Saami (347,1), Jewish Isreali (331,2), North African (329,7), Plains American Indian (322,1), Fijian Indian (319,2), Hong Kong Chinese (315,1), Norfolk Island (285,2), Greater Middle Eastern (281,2), Japanese Hashimoto (263,1), South African Afrikaner (251,2), Hispanic/Native American (237,1), Tatar (231,2), Southern Indian (229,2), Malawian (226,2), Tunisian (221,2), Portuguese (221,2), Asian American (221,3), Asian or Pacific Islander (207,3), Arab Isreali (189,1), Hispanic and European (180,2), Hispanic/Native (164,1), Bashkir (161,2), Han Chinese American (156,2), Nepalese (99,2), Oriental (90,1), European and African American (89,1), Solomon Islander (85,2), Uygur Chinese (83,1), Native American (82,5), Dai Chinese (58,1), European and Mexican (41,2), Jingpo Chinese (40,1), African and Asian (36,2), Tibetan (35,1), Romanian (32,1), European/Asian (27,2), Caucasian Eastern Mediterranean (26,2), Native Hawaiian or Pacific (14,2), Hui Chinese (11,1), Cape Verdian (4,2), Native American or Alaska Native (4,2), Curacao (2,2), Dominican Republic (2,2), Isreali (2,2), Afghanistan (2,2).


There are a total of 147 ancestries found in the 'REPLICATION SAMPLE SIZE' column. These are (number of people used used in all studies, number of studies included):

European (27921979,2804), East Asian (2060628,226), Japanese (1033402,279), Han Chinese (984435,247), African American (629630,312), Chinese (395476,136), Icelandic (309860,7), South Asian (273982,50), Korean (267668,134), European and Asian (136010,2), Asian (127254,48), Hispanic (126837,97), Indian Asian (118255,14), African (117983,57), African American/Afro Caribbean (104015,48), European and African American (99227,9), European and East Asian (87155,5), Finnish (75771,22), Hispanic/Latino (62435,12), Indian (48605,29), Hispanic/Latino American (43200,6), European and Middle East/North African (29807,2), African America/Afro Caribbean (27661,1), Mexican (27148,17), Latino (25299,17), European and Erasmus Rucphen (22789,1), Filipino (21546,23), African American and African (20809,1), Sardinian (20517,12), Dutch (20367,3), Malay (20270,11), Sub Saharan African (20072,22), Black and European (19090,1), Northern European (16962,3), Ashkenazi Jewish (15801,16), European and Indian Asian (13615,1), Asian Indian (12906,7), Indo European (12659,5), Brazilian (12510,12), African American and Afro Caribbean (11583,3), Southern Chinese (11058,3), Old Order Amish (10887,12), Mexican American (10627,22), Punjabi Sikh (10261,5), Thai (9649,23), Turkish (8855,11), African American or Afro Caribbean (8755,2), Hispanic/Latin American (8622,5), European American (8414,2), Erasmus Rucphen (8337,7), Mexican/Latino (8214,2), Saudi Arab (7692,12), American Indian (7039,7), Swedish (6212,3), Pakistani (6115,6), French Canadian (5708,13), Vietnamese (5490,4), Indonesian (5253,7), German (5092,2), Silk Road (5017,11), Southern Indian (4600,2), Middle Eastern (4591,5), West African (4355,6), Pima Indian (4296,7), Danish (4258,2), Singaporean (3968,2), North Indian (3956,4), Russian (3950,2), Singaporean Chinese (3580,2), Gambian (3463,2), Dravidian (3260,4), She Chinese (3235,1), Iranian (3210,9), Northwestern European (2942,2), Black (2895,10), Hispanic / Latino (2826,1), Seychellois (2766,12), Talana (2508,3), Amish (2429,4), African American and African Caribbean (2147,1), Arab (2120,13), Samoan (2102,1), European and Ashkenazi Jewish (2035,2), African American and European (1893,1), Polish (1810,9), Costa Rican (1778,3), South and Central American (1769,1), Cilento (1709,2), Hispanic and European (1682,2), North African (1566,10), Polynesian (1536,2), Nepalese (1478,6), Caribbean Hispanic (1390,6), Latin American (1386,6), Oceania (1372,5), Southern European (1296,1), Native Hawaiian (1277,2), Hutterite (1255,2), Moroccan (1247,3), Latino American (1242,2), European and Hispanic (1174,4), Jamaican (1089,2), Orcadian (1035,2), Afro Caribbean (962,1), Friuli Venezia Giulia (957,2), Val Borbera (910,1), Uygur Kazakh Chinese (840,2), Asian American (670,2), Spanish (662,2), Sorbian (659,1), Mongolian (542,2), Portuguese (525,2), Seychelles (492,2), South African Black (481,2), Mayan (475,2), Bulgarian (450,2), Ethiopian (433,6), Malaysian Chinese (420,2), Taiwanese (400,1), Middle Eastern Arab (352,2), Jewish (344,2), Carlantino (322,1), Afro Caribbean and Sub Saharan African (321,1), Surinamese (284,1), Hispanic and Native American (282,1), French (276,1), Western European (253,1), Malaysian (212,2), Arab Isreali (198,2), Japanese Hashimoto (181,1), Tibetan (161,1), Yoruban (146,2), Brazillian (104,1), Uighur (99,2), Ashkenazi Jewish Isreali (96,2), Hispanic American (85,1), Jewish Isreali (74,2), European/Asian (74,2), African America and European (68,1), Italian (45,2), African and Asian (38,2), Aboriginal Canadian (15,2), Native Hawaiian or Pacific (7,1), Black African (5,1), Native American or Alaska Native (4,1), South African (2,1), Colored African (1,1).

Lets put this number into tables to get a deeper understanding, first including a row which has at least one ancestry as NR, and then dropping all rows in which at least one recorded ancestry is NR:

In [8]:
Broad_Ancestral_Full_initial = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'initial'].groupby(['Broader Ancestral'])['N'].sum())
Broad_Ancestral_Full_initial.rename(
    columns={'N': 'N (Initial)'}, inplace=True)
Broad_Ancestral_Full_replication = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'replication'].groupby(['Broader Ancestral'])['N'].sum())
Broad_Ancestral_Full_replication.rename(
    columns={'N': 'N (Replication)'}, inplace=True)
Broad_Ancestral_Full = pd.merge(Broad_Ancestral_Full_initial,
                                Broad_Ancestral_Full_replication,
                                left_index=True, right_index=True)
Broad_Ancestral_Full['Total'] = Broad_Ancestral_Full['N (Initial)'] + \
    Broad_Ancestral_Full['N (Replication)']
Broad_Ancestral_Full['% Discovery'] = (
    Broad_Ancestral_Full['N (Initial)'] /
    Broad_Ancestral_Full['N (Initial)'].sum()) * 100
Broad_Ancestral_Full['% Replication'] = (
    Broad_Ancestral_Full['N (Replication)'] /
    Broad_Ancestral_Full['N (Replication)'].sum()) * 100
Broad_Ancestral_Full['% Total'] = (Broad_Ancestral_Full['Total'] /
                                   Broad_Ancestral_Full['Total'].sum()) * 100
Broad_Ancestral_Full.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'Tables',
                                         'Broad_Ancestral_Full.csv')))
Broad_Ancestral_Full.style
Out[8]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader Ancestral
African 265815 126152 391967 0.319764 0.313267 0.317644
African Am.\Caribbean 1.63833e+06 830852 2.46918e+06 1.97084 2.06321 2.00099
Asian 4.8103e+06 5.42934e+06 1.02396e+07 5.78659 13.4824 8.29805
European 6.92682e+07 2.79912e+07 9.72594e+07 83.3268 69.5091 78.8175
Hispanic\Latin American 1.19485e+06 378374 1.57322e+06 1.43736 0.939597 1.27492
In Part Not Recorded 5.4675e+06 4.68186e+06 1.01494e+07 6.57717 11.6262 8.22488
Other\Mixed 483380 832069 1.31545e+06 0.581486 2.06623 1.06602

Drop the 'in part not recorded' rows (i.e. where the Broad Ancestral Category contains at least one NR):

In [9]:
Broad_Ancestral_NoNR = Broad_Ancestral_Full[['N (Initial)',
                                             'N (Replication)',
                                             'Total']]
Broad_Ancestral_NoNR = Broad_Ancestral_NoNR.drop('In Part Not Recorded')
Broad_Ancestral_NoNR['Total'] = Broad_Ancestral_NoNR['N (Initial)'] + \
    Broad_Ancestral_NoNR['N (Replication)']
Broad_Ancestral_NoNR['% Discovery'] = (
    Broad_Ancestral_NoNR['N (Initial)'] /
    Broad_Ancestral_NoNR['N (Initial)'].sum()) * 100
Broad_Ancestral_NoNR['% Replication'] = (
    Broad_Ancestral_NoNR['N (Replication)'] /
    Broad_Ancestral_NoNR['N (Replication)'].sum()) * 100
Broad_Ancestral_NoNR['% Total'] = (Broad_Ancestral_NoNR['Total'] /
                                   Broad_Ancestral_NoNR['Total'].sum()) * 100
Broad_Ancestral_NoNR.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'Tables',
                                         'Broad_Ancestral_NoNR.csv')))
Broad_Ancestral_NoNR.style
Out[9]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader Ancestral
African 265815 126152 391967 0.342276 0.354479 0.346111
African Am.\Caribbean 1.63833e+06 830852 2.46918e+06 2.10959 2.33464 2.18031
Asian 4.8103e+06 5.42934e+06 1.02396e+07 6.19398 15.2561 9.04172
European 6.92682e+07 2.79912e+07 9.72594e+07 89.1932 78.6535 85.8811
Hispanic\Latin American 1.19485e+06 378374 1.57322e+06 1.53855 1.06321 1.38917
Other\Mixed 483380 832069 1.31545e+06 0.622424 2.33806 1.16156

Lets now make some bubble plots to visualise the raw broad ancestry field and our fields extracted from the free text strings:

In [10]:
plt.style.use('seaborn-ticks')
plt.rcParams["font.family"] = "Times New Roman"
                                                                         
GroupedBroadAncestries=Cat_Ancestry[(Cat_Ancestry['BROAD ANCESTRAL']!='Other') &
                                 (Cat_Ancestry['BROAD ANCESTRAL'].str.count(',')==0)].\
groupby(['BROAD ANCESTRAL'])['N'].sum().to_frame()
GroupedBroadAncestries['Percent Individuals'] = (GroupedBroadAncestries['N']/
                                                 GroupedBroadAncestries['N'].sum())*100
GroupedBroadAncestries = GroupedBroadAncestries.sort_values(by='Percent Individuals',ascending=False)[0:8]

countriesdict = {'African': '#4daf4a','African Am.\Caribbean':'#984ea3',
                 'Asian':'#e41a1c', 'European':'#377eb8','Hispanic\Latin American':'#ff7f00',
                 'Other/Mixed':'#a65628'}

fig = plt.figure(figsize=(12, 6), dpi=800)
ax = fig.add_subplot(1, 1, 1)
for obs in Cat_Ancestry.index:
    for key, value in countriesdict.items():
        if Cat_Ancestry['Broader Ancestral'][obs].strip() == key:
            ax.plot_date(x=pd.to_datetime(Cat_Ancestry['Dates'][obs]),
                         y=Cat_Ancestry['N'][obs],
                         color=value, marker='.', label='the data',
                         alpha=0.6, markersize=Cat_Ancestry['N'][obs] / 7500)
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=24))
ax.set_xlim(pd.Timestamp('2007-01-01'), pd.Timestamp('2017-12-31'))
ax.set_ylim(0, 500000)
ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
handles, labels = ax.get_legend_handles_labels()

inset_fig = inset_axes(ax, width=3.25,
                 height=1.5,loc=3, 
                 bbox_to_anchor=(0.275, .6), 
                 bbox_transform=ax.figure.transFigure)
barplot = Broad_Ancestral_NoNR[['% Discovery','% Replication']].transpose()
barplot = barplot.plot(kind='bar', colors = ['#4daf4a', '#984ea3',
                                             '#e41a1c', '#377eb8',
                                             '#ff7f00', '#a65628'],
                      
                       alpha=0.6,edgecolor='k',ax=inset_fig,width=0.9)
barplot.set_yticklabels('')
barplot.set_ylabel('')
barplot.set_xlabel('')
barplot.set_ylim(0,120)
barplot.get_yaxis().set_visible(False)
for p in barplot.patches:
    barplot.annotate(str(round(p.get_height(),2))+'%',
                     (p.get_x() + p.get_width() / 2., p.get_height()+20),
                    ha='center', va='center', rotation=90)
plt.legend(loc='center right', bbox_to_anchor=(0.025, 0.675),facecolor='white',edgecolor='black',frameon=True, prop={'size': 10})

sns.despine(top=True, left=True,right=False, ax=ax,offset=10)
sns.despine(top=True, right=True, left=True,offset=5,ax=inset_fig)

plt.savefig(os.path.abspath(os.path.join('__file__' ,"../..","Figures","svg",
                                         "Ancestral_Plots.svg")),
            bbox_inches='tight')
plt.savefig(os.path.abspath(os.path.join('__file__' ,"../..","Figures","svg",
                                         "Ancestral_Plots.png")),
            bbox_inches='tight')

We can also analyze how these aggregates change over time, and this was a key feature of Popejoy and Fullerton (2016). We can provide a much more granular arguement (merely remove '_NoNR' from the cell below to undertake an equivlent analysis with rows which contain some in part NR ancestries):

In [11]:
index = [x for x in range(2007, 2018)]
columns = ['European', 'Asian', 'African', 'Hispanic\Latin American',
           'Other\Mixed', 'African Am.\Caribbean']
Cat_Ancestry_NoNR = Cat_Ancestry[
    Cat_Ancestry['Broader Ancestral'] != 'In Part Not Recorded']
Broad_Ancestral_Time_NoNR_PC = pd.DataFrame(index=index, columns=columns)
for year in range(2007, 2018):
    for broaderancestry in Broad_Ancestral_NoNR.index.tolist():
        Broad_Ancestral_Time_NoNR_PC[broaderancestry.strip()][year] =\
            (Cat_Ancestry_NoNR[(
                Cat_Ancestry_NoNR['DATE'].str.contains(str(year))) &
                (Cat_Ancestry_NoNR['Broader Ancestral'] ==
                 broaderancestry)]['N'].sum() /
             Cat_Ancestry_NoNR[
             (Cat_Ancestry_NoNR['DATE'].str.contains(
                 str(year)))]['N'].sum()) * 100
Broad_Ancestral_Time_NoNR_PC.style
Out[11]:
European Asian African Hispanic\Latin American Other\Mixed African Am.\Caribbean
2007 95.4688 2.13984 0.00542792 0.715245 1.18391 0.486807
2008 95.1011 3.06307 0 0.00187221 1.26739 0.56653
2009 88.1708 7.10488 0.258137 0.224117 3.36046 0.881605
2010 86.5405 10.13 0.273146 0.0596777 2.49373 0.502897
2011 78.19 15.8796 0.150837 0.399222 1.71601 3.66433
2012 71.1466 20.0446 0.322268 0.910714 2.95966 4.61619
2013 81.5499 12.0984 0.4095 0.817999 0.641572 4.48268
2014 76.4231 18.7398 0.244282 1.16827 0.991714 2.43283
2015 84.4728 11.8884 0.365494 0.956217 0.65971 1.65735
2016 90.9706 4.54435 0.197934 1.53854 1.29867 1.44988
2017 88.1409 6.29419 0.563915 2.29589 0.518412 2.18666

We can focus on the initial discovery stage to calculate the number of individuals required per ancestry to unconver one hit. Note however that this does require some distributional assumptions (i.e. that the same diseases are being studied across ancestries).

In [12]:
Cat_Ancestry_Initial = Cat_Ancestry[Cat_Ancestry['STAGE'] == 'initial']

Cat_Ancestry_NoDups = Cat_Ancestry_Initial.drop_duplicates(
    subset=['STUDY ACCESSION'],
    keep=False)[['PUBMEDID', 'STUDY ACCESSION',
                 'BROAD ANCESTRAL',
                 'N']]
Cat_Ancestry_NoDups_merge = pd.merge(Cat_Ancestry_NoDups,
                                     Cat_Studies[['ASSOCIATION COUNT',
                                                  'STUDY ACCESSION',
                                                  'MAPPED_TRAIT']],
                                     how='left', on='STUDY ACCESSION')

listtoiterate = ['European', 'East Asian', 'South Asian',
                 'African Am.\Caribbean', 'Hispanic\Latin American']
for ancestry in listtoiterate:
    temp = Cat_Ancestry_NoDups_merge[Cat_Ancestry_NoDups_merge[
        'BROAD ANCESTRAL'].str.strip() == ancestry]
    print('The number of ' + ancestry + 's required to find one hit: ' +
          str(round(1 /
                    (temp['ASSOCIATION COUNT'].sum() / temp['N'].sum()), 1)))
The number of Europeans required to find one hit: 1443.7
The number of East Asians required to find one hit: 722.4
The number of South Asians required to find one hit: 616.8
The number of African Am.\Caribbeans required to find one hit: 467.9
The number of Hispanic\Latin Americans required to find one hit: 364.9

3.2. Choropleth Map of Country of Recruitment

This is a choropleth map of country of recruitment - note that this field is not fully populated in the Catalogue. Basemap code is loosely based on this. As we want to study the number of people recruited from each country, we can only utilize values in the ‘Country of Recruitment’ field when only one country is mentioned. For example: if N=100,000 and Country of Recruitment = {U.K., U.S.}, there is no way for us to know what the breakdown between the two countries is in the Catalogue (especially as the free text field may just contain 'European' ancestry). Our only option in this scenario is to drop such observations. This forces us to drop about 22% of all the studies, and this corresponds to about 48% of the Catalogue as measured by N (because bigger studies have multiple countries of recruitment). We faced with multiple entries equal to ‘NR’, which corresponds to ‘not recorded’. This reduces the number of studies to go into the map by a further 21% (of the initial total), and this means that the map only includes about half of all GWAS studies. This has no relationship to whether ancestry is coded.

In [13]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Clean_CoR = make_clean_CoR(Cat_Ancestry)
Cleaned_CoR_N = pd.DataFrame(Clean_CoR.groupby(['Cleaned Country'])['N'].sum())
Cleaned_CoR_N['Rank_N'] = Cleaned_CoR_N.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = pd.DataFrame(
    Clean_CoR.groupby(['Cleaned Country'])['N'].count())
Cleaned_CoR_count['Rank_count'] = Cleaned_CoR_count.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = Cleaned_CoR_count.rename(columns={'N': 'Count'})
Cleaned_CoR = pd.merge(Cleaned_CoR_count, Cleaned_CoR_N,
                       left_index=True, right_index=True)
del Cleaned_CoR.index.name
Cleaned_CoR['Percent of Count'] = (Cleaned_CoR['Count'] /
                                   Cleaned_CoR['Count'].sum()) * 100
Cleaned_CoR['Percent of N'] = (Cleaned_CoR['N'] /
                               Cleaned_CoR['N'].sum()) * 100
Cleaned_CoR.sort_values('Rank_N', ascending=True)[
    ['Count', 'N', 'Percent of Count', 'Percent of N']].head(10)
Cleaning for single country of recruitment field results in 72.419% of ancestry observations remaining. 
This represents about 54.248% of the total GWAS N. 
When we drop for country==NR, we lose another: 23.63% papers.
In total, we have 5384 observations for this field, out of a total of 9735 rows of Cat_Anc data
Out[13]:
Count N Percent of Count Percent of N
United Kingdom 549 15238916.0 10.196880 41.840677
United States 2292 9231811.0 42.570579 25.347290
Iceland 59 3482621.0 1.095840 9.562046
Japan 350 2032401.0 6.500743 5.580255
China 440 1836353.0 8.172363 5.041976
Korea, South 226 730946.0 4.197623 2.006919
Netherlands 160 645945.0 2.971768 1.773537
Finland 134 570202.0 2.488856 1.565573
Germany 164 407208.0 3.046062 1.118049
Australia 98 290948.0 1.820208 0.798840
In [14]:
bins = np.linspace(1, len(Cleaned_CoR['Rank_N'].unique()),
                   len(Cleaned_CoR['Rank_N'].unique()))
cm = plt.get_cmap('OrRd')
scheme = [cm(i / 10) for i in range(10)]
Cleaned_CoR['scheme'] = [
    cm(Cleaned_CoR['N'][i] /
       Cleaned_CoR['N'].max()) for i in Cleaned_CoR.index]
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(111, facecolor='#deeff5', frame_on=False)
scheme = [cm(i / len(Cleaned_CoR)) for i in range(len(Cleaned_CoR))]
m = Basemap(lon_0=0, projection='robin', resolution='i')
m.drawmapboundary(color='k', linewidth=0.075)
m.drawcountries(color='k', linewidth=0.025)
m.drawmeridians(np.arange(0, 360, 60), labels=[False, False, False, True],
                color='#bdbdbd', dashes=[6, 6], linewidth=0.1, fontsize=10)
m.drawparallels(np.arange(-90, 90, 30), labels=[True, False, False, False],
                color='#bdbdbd', dashes=[6, 6], linewidth=0.1, fontsize=10)
m.readshapefile(os.path.abspath(os.path.join('__file__', '../..', 'Data',
                                             'ShapeFiles',
                                             'ne_10m_admin_0_countries')),
                'units', color='#444444', linewidth=.075)
for info, shape in zip(m.units_info, m.units):
    country = info['NAME_CIAWF']
    if country not in Cleaned_CoR.index:
        color = 'w'
    else:
        color = Cleaned_CoR.loc[country]['scheme']
    patches = [Polygon(np.array(shape), True)]
    pc = PatchCollection(patches)
    pc.set_facecolor(color)
    ax.add_collection(pc)
norm = Normalize()
mapper = mpl.cm.ScalarMappable(norm=norm, cmap=cm)
mapper.set_array(Cleaned_CoR['N'] / 1000000)
clb = plt.colorbar(mapper, shrink=0.75)
clb.ax.tick_params(labelsize=10)
clb.ax.set_title('Number of\n People (m)', y=1.025, fontsize=12)
plt.savefig(os.path.abspath(os.path.join('__file__', '../..', 'Figures', 'svg',
                                         'Country_Rec_N.svg')),
            bbox_inches='tight')
plt.show()

Lets now merge that via a country lookup to continent based file using data from within the shapefile itself (based on CIAWF).

In [15]:
countrylookup = pd.read_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'Data',
                                         'ShapeFiles', 'Country_Lookup.csv')),
                            index_col='Country')
country_merged = pd.merge(countrylookup, Cleaned_CoR,
                          left_index=True, right_index=True)
country_merged_sumcount = country_merged[[
    'Continent', 'Count']].groupby(['Continent']).sum()
country_merged_sumN = country_merged[[
    'Continent', 'N']].groupby(['Continent']).sum()
country_merged_sums = pd.merge(
    country_merged_sumN, country_merged_sumcount,
    left_index=True, right_index=True)
country_merged_sums['N (%)'] = (
    country_merged_sums['N'] / country_merged_sums['N'].sum()) * 100
country_merged_sums['Count (%)'] = (
    country_merged_sums['Count'] / country_merged_sums['Count'].sum()) * 100
country_merged_sums.to_csv(os.path.abspath(
                           os.path.join('__file__', '../..', 'Tables',
                                        'ContinentOfRecruitment.csv')))
country_merged_sums
Out[15]:
N Count N (%) Count (%)
Continent
Africa 39742.0 43 0.109117 0.798663
Asia 5180891.0 1298 14.224895 24.108470
Europe 21529123.0 1528 59.111362 28.380386
North America 9345005.0 2380 25.658081 44.205052
Oceania 302888.0 107 0.831623 1.987370
Seven seas (open ocean) 1389.0 1 0.003814 0.018574
South America 22256.0 27 0.061107 0.501486

4. EFO and Parent Traits

4.1 What EFO and Parent trait get studied?

Lets quickly see the effect of splitting or dropping for delimited entries in the trait field:

In [16]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)
print('There are ' + str(len(EFO_Parent_Mapped)) +
      ' rows of EFO terms after we split for commas')
print('This indicates ' + str(len(EFO_Parent_Mapped) -
                              len(EFO_Parent_Mapped_NoCommas)) +
      ' additional terms were mapped to' +
      ' Parents than for when we drop comma seperated values')
print('This indicates ' +
      str(len(EFO_Parent_Mapped['EFO term'].drop_duplicates())) +
      ' unique EFO terms to map to Parents')
print('This is in comparison to ' +
      str(len(EFO_Parent_Mapped_NoCommas['EFO term'].drop_duplicates())) +
      ' unique EFO terms in Cat_Studies')
There are 6623 rows of EFO terms after we split for commas
This indicates 2972 additional terms were mapped to Parents than for when we drop comma seperated values
This indicates 1797 unique EFO terms to map to Parents
This is in comparison to 1246 unique EFO terms in Cat_Studies

Lets plot this out in terms of frequency of study, number of people involved and the number of associations found:

In [17]:
plt.style.use('seaborn-ticks')
plt.rcParams['font.family'] = 'Times New Roman'

MappedTrait_df_Count = EFO_Parent_Mapped.groupby('EFO term')['EFO term'].count(
).to_frame().rename(columns={'EFO term': 'Number of Studies'}).reset_index()
MappedTrait_df_AncSum = EFO_Parent_Mapped.groupby('EFO term')['N'].sum(
).to_frame().rename(columns={'N': 'Total Sample'}).reset_index()
MappedTrait_df_AssSum = EFO_Parent_Mapped.groupby(
    'EFO term')['ASSOCIATION COUNT'].sum().to_frame().rename(columns={
        'ASSOCIATION COUNT': 'Total Associations'}).reset_index()
MappedTrait_df_toplot = pd.merge(
    MappedTrait_df_AssSum, MappedTrait_df_AncSum, how='left', on='EFO term')
MappedTrait_df_toplot = pd.merge(
    MappedTrait_df_toplot, MappedTrait_df_Count, how='left', on='EFO term')
MappedTrait_df_toplot['Studies'] = (
    MappedTrait_df_toplot['Number of Studies'] /
    MappedTrait_df_toplot['Number of Studies'].sum()) * 100
MappedTrait_df_toplot['Associations'] = (
    MappedTrait_df_toplot['Total Associations'] /
    MappedTrait_df_toplot['Total Associations'].sum()) * 100
MappedTrait_df_toplot['Sample'] = (
    MappedTrait_df_toplot['Total Sample'] /
    MappedTrait_df_toplot['Total Sample'].sum()) * 100
MappedTrait_df_toplot = MappedTrait_df_toplot.set_index('EFO term')
MappedTrait_df_toplot = MappedTrait_df_toplot.sort_values(
    by='Studies', ascending=False)[0:17]
MappedTrait_df_toplot = MappedTrait_df_toplot[[
    'Sample', 'Studies', 'Associations']]

EFOParent_df_Count = EFO_Parent_Mapped.groupby(
    'Parent term')['Parent term'].count(
).to_frame().rename(columns={'Parent term': 'Number of Studies'}).reset_index()
EFOParent_df_AncSum = EFO_Parent_Mapped.groupby('Parent term')['N'].sum(
).to_frame().rename(columns={'N': 'Total Sample'}).reset_index()
EFOParent_df_AssSum = EFO_Parent_Mapped.groupby(
    'Parent term')['ASSOCIATION COUNT'].sum().to_frame().rename(
    columns={'ASSOCIATION COUNT': 'Total Associations'}).reset_index()
EFOParent_df_toplot = pd.merge(
    EFOParent_df_AssSum, EFOParent_df_AncSum, how='left', on='Parent term')
EFOParent_df_toplot = pd.merge(
    EFOParent_df_toplot, EFOParent_df_Count, how='left', on='Parent term')
EFOParent_df_toplot['Studies'] = (
    EFOParent_df_toplot['Number of Studies'] /
    EFOParent_df_toplot['Number of Studies'].sum()) * 100
EFOParent_df_toplot['Associations'] = (
    EFOParent_df_toplot['Total Associations'] /
    EFOParent_df_toplot['Total Associations'].sum()) * 100
EFOParent_df_toplot['Sample'] = (
    EFOParent_df_toplot['Total Sample'] /
    EFOParent_df_toplot['Total Sample'].sum()) * 100
EFOParent_df_toplot = EFOParent_df_toplot.set_index('Parent term')
EFOParent_df_toplot = EFOParent_df_toplot.sort_values(
    by='Studies', ascending=False)
EFOParent_df_toplot = EFOParent_df_toplot[[
    'Sample', 'Studies', 'Associations']]

fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(18, 7.5))
fig_a = MappedTrait_df_toplot.sort_values(by='Studies', ascending=True).plot(
    kind='barh', ax=axes[0], legend=False, colors=['goldenrod', 'steelblue',
                                                   'tomato'], edgecolor='k',
    width=0.66)
fig_a.set_ylabel('')
fig_a.set_xlabel('Percent of Mapped Traits (%)', fontsize=12)
fig_b = EFOParent_df_toplot.sort_values(by='Studies', ascending=True).plot(
    kind='barh', ax=axes[1], legend=True, colors=['goldenrod',
                                                  'steelblue', 'tomato'],
    edgecolor='k', width=0.66)
fig_b.set_ylabel('')
fig_b.set_xlabel('Percent of Parent Traits (%)', fontsize=12)
fig_b.legend(prop={'size': 12}, frameon=True,
             loc='bottom right', bbox_to_anchor=(1, 0.2), edgecolor='k')
plt.tight_layout()
sns.despine()
plt.savefig(os.path.abspath(os.path.join('__file__', '../..', 'Figures', 'svg',
                                         'Diseases_Traits_EFO_Bars.svg')),
            bbox_inches='tight')

Lets now decompose this across the seven different 'broader ancestral categories' we defined above:

In [18]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas= EFO_parent_mapper(Cat_Studies, Cat_Ancestry_groupedbyN)
Parent_Mapped_broaderancestry=pd.merge(EFO_Parent_Mapped[['STUDY ACCESSION','Parent term']],
                                           Cat_Ancestry[['STUDY ACCESSION','Broader Ancestral',
                                                         'N']],
                                           how='left',on='STUDY ACCESSION')


Parent_broaderancestry_groupedby = Parent_Mapped_broaderancestry.groupby(['Broader Ancestral','Parent term'],
                                                as_index=False)['N'].sum()
for ancestry in Parent_broaderancestry_groupedby['Broader Ancestral'].unique():
    holdstring='The order of the studied Parent terms for the '+ancestry+' category are: '
    tempdf = Parent_broaderancestry_groupedby[Parent_broaderancestry_groupedby['Broader Ancestral']==ancestry]
    tempdf['Percent of Total']=(tempdf['N']/tempdf['N'].sum())*100
    tempdf=tempdf.sort_values(by='N',ascending=False)
    for index, row in tempdf.iterrows():
        holdstring=holdstring+row['Parent term']+' ('+str(round(row['Percent of Total'],2))+'%) ; '
    print(holdstring[:-2]+'\n')

EFO_Mapped_broaderancestry=pd.merge(EFO_Parent_Mapped[['STUDY ACCESSION','EFO term']],
                                           Cat_Ancestry[['STUDY ACCESSION','Broader Ancestral',
                                                         'N']],
                                           how='left',on='STUDY ACCESSION')
EFO_broaderancestry_groupedby = EFO_Mapped_broaderancestry.groupby(['Broader Ancestral','EFO term'],
                                                as_index=False)['N'].sum()
for ancestry in EFO_broaderancestry_groupedby['Broader Ancestral'].unique():
    holdstring='The order of the studied EFO terms for the '+ancestry+' category are: '
    tempdf = EFO_broaderancestry_groupedby[EFO_broaderancestry_groupedby['Broader Ancestral']==ancestry]
    tempdf['Percent of Total']=(tempdf['N']/tempdf['N'].sum())*100
    tempdf=tempdf.sort_values(by='N',ascending=False)
    for index, row in tempdf[0:20].iterrows():
        holdstring=holdstring+row['EFO term']+' ('+str(round(row['Percent of Total'],2))+'%) ; '
    print(holdstring[:-2]+'\n')
The order of the studied Parent terms for the African  category are: Other Meas. (42.44%) ; Other Disease (19.36%) ; Cardiovascular Meas. (13.86%) ; Cardiovascular Disease (10.47%) ; Body Meas. (4.36%) ; Cancer (3.2%) ; Neurological Disorder (1.24%) ; Response To Drug (1.24%) ; Hematological Meas. (0.84%) ; Lipid/Lipoprotein Meas. (0.8%) ; Other Trait (0.61%) ; Biological Process (0.54%) ; Digestive System Disorder (0.46%) ; Immune System Disorder (0.45%) ; Inflammatory Meas. (0.13%) 

The order of the studied Parent terms for the African Am.\Caribbean category are: Body Meas. (23.97%) ; Other Meas. (21.75%) ; Hematological Meas. (11.65%) ; Cardiovascular Meas. (6.85%) ; Biological Process (6.82%) ; Cancer (4.55%) ; Cardiovascular Disease (4.15%) ; Other Disease (4.03%) ; Response To Drug (3.68%) ; Neurological Disorder (3.58%) ; Lipid/Lipoprotein Meas. (3.06%) ; Metabolic Disorder (2.34%) ; Other Trait (1.55%) ; Digestive System Disorder (1.31%) ; Inflammatory Meas. (0.39%) ; Immune System Disorder (0.31%) ; Liver Enzyme Meas. (0.0%) 

The order of the studied Parent terms for the Asian  category are: Other Meas. (22.22%) ; Body Meas. (11.09%) ; Lipid/Lipoprotein Meas. (8.29%) ; Cardiovascular Disease (7.88%) ; Metabolic Disorder (7.0%) ; Other Disease (6.91%) ; Cancer (6.69%) ; Digestive System Disorder (6.02%) ; Immune System Disorder (4.96%) ; Neurological Disorder (4.43%) ; Hematological Meas. (4.03%) ; Cardiovascular Meas. (3.89%) ; Other Trait (2.46%) ; Biological Process (1.78%) ; Liver Enzyme Meas. (1.54%) ; Response To Drug (0.56%) ; Inflammatory Meas. (0.26%) 

The order of the studied Parent terms for the European  category are: Other Meas. (26.78%) ; Body Meas. (10.42%) ; Neurological Disorder (9.78%) ; Cancer (9.61%) ; Hematological Meas. (6.37%) ; Biological Process (5.87%) ; Cardiovascular Disease (5.42%) ; Other Trait (4.15%) ; Other Disease (3.58%) ; Lipid/Lipoprotein Meas. (3.43%) ; Cardiovascular Meas. (3.4%) ; Immune System Disorder (2.9%) ; Digestive System Disorder (2.45%) ; Response To Drug (2.37%) ; Metabolic Disorder (2.35%) ; Inflammatory Meas. (0.87%) ; Liver Enzyme Meas. (0.26%) 

The order of the studied Parent terms for the Hispanic\Latin American category are: Other Meas. (30.09%) ; Cardiovascular Meas. (12.83%) ; Hematological Meas. (11.51%) ; Response To Drug (10.78%) ; Other Disease (6.18%) ; Body Meas. (5.31%) ; Lipid/Lipoprotein Meas. (5.09%) ; Other Trait (4.56%) ; Biological Process (3.24%) ; Cardiovascular Disease (2.82%) ; Metabolic Disorder (2.11%) ; Cancer (1.91%) ; Inflammatory Meas. (1.42%) ; Neurological Disorder (0.95%) ; Digestive System Disorder (0.79%) ; Immune System Disorder (0.32%) ; Liver Enzyme Meas. (0.08%) 

The order of the studied Parent terms for the In Part Not Recorded category are: Cardiovascular Disease (27.15%) ; Other Meas. (18.93%) ; Neurological Disorder (13.82%) ; Other Trait (7.21%) ; Body Meas. (6.51%) ; Lipid/Lipoprotein Meas. (5.24%) ; Inflammatory Meas. (4.48%) ; Cancer (2.95%) ; Metabolic Disorder (2.35%) ; Immune System Disorder (2.3%) ; Biological Process (2.29%) ; Cardiovascular Meas. (1.9%) ; Other Disease (1.78%) ; Digestive System Disorder (1.37%) ; Response To Drug (0.78%) ; Hematological Meas. (0.51%) ; Liver Enzyme Meas. (0.42%) 

The order of the studied Parent terms for the Other/Mixed category are: Neurological Disorder (20.76%) ; Other Meas. (15.64%) ; Other Trait (11.97%) ; Cardiovascular Disease (9.21%) ; Cancer (8.8%) ; Lipid/Lipoprotein Meas. (6.75%) ; Hematological Meas. (6.26%) ; Other Disease (6.21%) ; Body Meas. (4.87%) ; Metabolic Disorder (2.71%) ; Inflammatory Meas. (2.12%) ; Immune System Disorder (1.68%) ; Response To Drug (1.6%) ; Biological Process (0.65%) ; Digestive System Disorder (0.51%) ; Cardiovascular Meas. (0.27%) ; Liver Enzyme Meas. (0.0%) 

The order of the studied EFO terms for the African  category are: Systolic Blood Pressure (12.78%) ; Pulse Pressure Meas. (12.78%) ; Diastolic Blood Pressure (12.78%) ; Hypertension (10.33%) ; Malaria (9.33%) ; Serum Creatinine Meas. (6.07%) ; Glomerular Filtration Rate (6.02%) ; Tuberculosis (5.43%) ; Body Mass Index (3.86%) ; Prostate Carcinoma (2.59%) ; Blood Pressure (1.6%) ; Bacteriemia (1.08%) ; Pneumococcal Bacteremia (0.82%) ; LD LC measurement (0.8%) ; Herpes Zoster (0.63%) ; Hiv-1 Infection (0.54%) ; Sickle Cell Anemia (0.51%) ; Bnp Meas. (0.5%) ; Fetal Hemoglobin Meas. (0.49%) ; Attempted Suicide (0.47%) 

The order of the studied EFO terms for the African Am.\Caribbean category are: Body Mass Index (10.2%) ; Smoking Behavior (5.87%) ; Physical Activity Meas. (4.3%) ; Bmi-Adjusted Waist-Hip Ratio (3.86%) ; Bmi-Adjusted Waist Circumference (3.24%) ; Waist-Hip Ratio (2.58%) ; Type II Diabetes Mellitus (1.95%) ; Hypertension (1.56%) ; Body Height (1.55%) ; Blood Pressure (1.55%) ; Waist Circumference (1.53%) ; Hematocrit (1.51%) ; Hemoglobin Meas. (1.51%) ; Erythrocyte Count (1.51%) ; Breast Carcinoma (1.42%) ; Qt Interval (1.24%) ; Leukocyte Count (1.23%) ; Mean Corpuscular Volume (1.21%) ; Prostate Carcinoma (1.2%) ; Smoking Behaviour Meas. (1.18%) 

The order of the studied EFO terms for the Asian  category are: Type II Diabetes Mellitus (6.53%) ; Body Mass Index (5.21%) ; Systolic Blood Pressure (2.67%) ; Diastolic Blood Pressure (2.67%) ; Hypertension (2.3%) ; HD LC measurement (2.26%) ; Breast Carcinoma (2.23%) ; Atrial Fibrillation (2.21%) ; Triglyceride Meas. (2.15%) ; Renal System Meas. (1.82%) ; Colorectal Cancer (1.81%) ; LD LC measurement (1.8%) ; Blood Pressure (1.77%) ; Rheumatoid Arthritis (1.56%) ; Pulse Pressure Meas. (1.51%) ; Total Cholesterol Meas. (1.47%) ; Body Height (1.34%) ; Bmi-Adjusted Waist-Hip Ratio (1.29%) ; Mean Arterial Pressure (1.27%) ; Bmi-Adjusted Waist Circumference (1.26%) 

The order of the studied EFO terms for the European  category are: Body Mass Index (4.44%) ; Smoking Behavior (2.85%) ; Unipolar Depression (2.23%) ; Bmi-Adjusted Waist-Hip Ratio (1.81%) ; Physical Activity Meas. (1.76%) ; Bmi-Adjusted Waist Circumference (1.65%) ; Ischemic Stroke (1.61%) ; Type II Diabetes Mellitus (1.43%) ; Breast Carcinoma (1.36%) ; Mood Disorder (1.29%) ; Atrial Fibrillation (1.22%) ; Diastolic Blood Pressure (1.16%) ; Systolic Blood Pressure (1.15%) ; Self Reported Educational Attainment (1.06%) ; Basal Cell Carcinoma (0.91%) ; Neuroticism Meas. (0.84%) ; Triglyceride Meas. (0.84%) ; Parkinson'S Disease (0.82%) ; HD LC measurement (0.81%) ; LD LC measurement (0.81%) 

The order of the studied EFO terms for the Hispanic\Latin American category are: Qt Interval (4.84%) ; Response To Tricyclic Antidepressant (3.92%) ; Response To Tetracyclic Antidepressant (3.92%) ; Heart Rate Variability Meas. (2.34%) ; Sleep Apnea Meas. (2.25%) ; Body Mass Index (2.22%) ; Smoking Behavior (2.21%) ; Response To Sulfonylurea (2.13%) ; Type II Diabetes Mellitus (1.92%) ; Systolic Blood Pressure (1.91%) ; Diastolic Blood Pressure (1.91%) ; Pulse Pressure Meas. (1.86%) ; Obstructive Sleep Apnea (1.66%) ; Asthma (1.6%) ; HD LC measurement (1.51%) ; Triglyceride Meas. (1.48%) ; Hypertension (1.47%) ; Total Cholesterol Meas. (1.14%) ; Smoking Behaviour Meas. (1.08%) ; Leukocyte Count (1.06%) 

The order of the studied EFO terms for the In Part Not Recorded category are: Coronary Artery Disease (15.44%) ; Schizophrenia (7.1%) ; C-Reactive Protein Meas. (4.46%) ; Coronary Heart Disease (2.62%) ; Coronary Artery Bypass (2.25%) ; Myocardial Infarction (2.24%) ; Percutaneous Transluminal Coronary Angioplasty (2.23%) ; Ischemic Cardiomyopathy (2.23%) ; Angina Pectoris (2.23%) ; Spine Bone Mineral Density (2.2%) ; Type II Diabetes Mellitus (1.65%) ; Migraine Disorder (1.56%) ; Systolic Blood Pressure (1.54%) ; Diastolic Blood Pressure (1.54%) ; Waist Circumference (1.52%) ; Bmi-Adjusted Waist Circumference (1.5%) ; Pulse Pressure Meas. (1.5%) ; Waist-Hip Ratio (1.39%) ; Bmi-Adjusted Waist-Hip Ratio (1.38%) ; Bipolar Disorder (1.32%) 

The order of the studied EFO terms for the Other/Mixed category are: Ischemic Stroke (9.46%) ; Parkinson'S Disease (6.44%) ; Schizophrenia (5.75%) ; Total Cholesterol Meas. (5.22%) ; Amyotrophic Lateral Sclerosis (5.22%) ; Bone Density (3.89%) ; Asthma (3.37%) ; Prostate Carcinoma (3.27%) ; Body Mass Index (2.64%) ; Cardiac Embolism (2.61%) ; Coronary Artery Disease (2.44%) ; Intraocular Pressure Meas. (2.33%) ; Type II Diabetes Mellitus (2.13%) ; C-Reactive Protein Meas. (2.12%) ; Melanoma (1.83%) ; Alzheimers Disease (1.75%) ; Sex Interaction Meas. (1.52%) ; Body Height (1.43%) ; Chronic Obstructive Pulmonary Disease (1.28%) ; Sudden Cardiac Arrest (1.2%) 

5. Who funds GWAS and what do they fund?

5.1 A simple descriptive tabulation

Lets do a super simple tabulation of the Top 20 GWAS funders (measured by 'GWAS contributions': i.e. a funder funding an authors time on a paper).

In [19]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
AllFunders = FunderInfo.groupby(by='Agency').count()
AllFunders.index.name = None
AllFunders = AllFunders.reset_index()
AllFunders = AllFunders.rename(columns={
                               'index': 'Agency',
                               'PUBMEDID': 'Grant Contributions',
                               'GrantCountry': 'Country'})
AllFunders_withcountries = pd.merge(AllFunders[['Agency',
                                                'Grant Contributions']],
                                    FunderInfo[['Agency',
                                                'GrantCountry']].drop_duplicates(
                                        'Agency'),
                                    on='Agency', how='left')
AllFunders_withcountries = AllFunders_withcountries.set_index('Agency')
AllFunders_withcountries.index.name = None
AllFunders_withcountries['% of Total'] = round((
    AllFunders_withcountries['Grant Contributions'] /
    AllFunders_withcountries['Grant Contributions'].sum()) * 100, 2)
AllFunders_withcountries.sort_values(
    'Grant Contributions', ascending=False)[0:20]
Out[19]:
Grant Contributions GrantCountry % of Total
NHLBI NIH HHS 12388 United States 27.23
NCI NIH HHS 5013 United States 11.02
NIA NIH HHS 3872 United States 8.51
MRC 2913 United Kingdom 6.40
NIMH NIH HHS 2593 United States 5.70
NIDDK NIH HHS 2470 United States 5.43
NHGRI NIH HHS 1718 United States 3.78
Wellcome Trust 1608 United Kingdom 3.53
NCRR NIH HHS 1527 United States 3.36
PHS HHS 1137 United States 2.50
NIAID NIH HHS 1028 United States 2.26
NIAMS NIH HHS 938 United States 2.06
NIAAA NIH HHS 886 United States 1.95
NINDS NIH HHS 723 United States 1.59
NIDA NIH HHS 682 United States 1.50
NCATS NIH HHS 594 United States 1.31
Intramural NIH HHS 573 United States 1.26
Cancer Research UK 564 United Kingdom 1.24
NIGMS NIH HHS 547 United States 1.20
NEI NIH HHS 434 United States 0.95

Lets print out some simple descriptives from this data:

In [20]:
print('There are ' + str(len(FunderInfo['Agency'].drop_duplicates())) +
      ' unique funders returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantID'].drop_duplicates())) +
      ' unique grants returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantCountry'].drop_duplicates())) +
      ' unique grant countries returned from PubMed Central.')
print('Each study has an average of ' +
      str(round(len(FunderInfo) / len(id_list), 2)) + ' grants funding it.')
grantgroupby = FunderInfo.groupby(['Agency', 'GrantID']).size().groupby(
    level=1).max().sort_values(ascending=False).reset_index()
print('The most frequently acknowledged grant is GrantID ' +
      grantgroupby['GrantID'][0] + ' which is acknowledged ' +
      str(grantgroupby[0][0]) + ' times.')
There are 87 unique funders returned from PubMed Central.
There are 12167 unique grants returned from PubMed Central.
There are 7 unique grant countries returned from PubMed Central.
Each study has an average of 13.73 grants funding it.
The most frequently acknowledged grant is GrantID P30 DK063491 which is acknowledged 182 times.

5.2 International Distribution of Funders

From which countries do these grants come from?

In [21]:
TopCountryFunders = FunderInfo.groupby(by='GrantCountry').count()
TopCountryFunders = TopCountryFunders.rename(
    columns={'PUBMEDID': 'Number Of Studies'})
TopCountryFunders = TopCountryFunders.sort_values(
    'Number Of Studies', ascending=False)[['Number Of Studies']]
TopCountryFunders = TopCountryFunders.reset_index().rename(
    columns={'GrantCountry': 'Country of Funder'})
TopCountryFunders['Percent Funded (%)'] = (
    TopCountryFunders['Number Of Studies'] /
    TopCountryFunders['Number Of Studies'].sum()) * 100
TopCountryFunders = TopCountryFunders.set_index('Country of Funder')
TopCountryFunders.index.name = None
TopCountryFunders.style
Out[21]:
Number Of Studies Percent Funded (%)
United States 39273 86.3162
United Kingdom 6006 13.2003
Canada 165 0.362645
International 38 0.0835183
Austria 6 0.0131871
None 6 0.0131871
Italy 5 0.0109893

5.3 What do they fund?

In [22]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped['Parent term'] = \
    EFO_Parent_Mapped['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_parent = pd.DataFrame(
    index=FunderInfo.groupby(
        ['Agency'])['Agency'].count().sort_values(ascending=False).
    index.values[0:20].tolist(),
    columns=FunderInfo_Parent.groupby(['Parent term'])['Parent term'].count().
    sort_values(ascending=False).index.values.tolist())

for parent in FunderInfo_Parent.groupby(['Parent term'])['Parent term'].count().\
        sort_values(ascending=False).index.values.tolist():
    for funder in FunderInfo.groupby(['Agency'])['Agency'].count().sort_values(
            ascending=False).index.values[0:20].tolist():
        funder_parent[parent][funder] = len(
            FunderInfo_Parent[(FunderInfo_Parent['Agency'] == funder) & (
                FunderInfo_Parent['Parent term'] == parent)])
FunderInfo_Ancestry = pd.merge(FunderInfo, Cat_Ancestry, left_on='PUBMEDID',
                               right_on='PUBMEDID', how='left')

funder_ancestry = pd.DataFrame(index=FunderInfo.groupby(
    ['Agency'])['Agency'].count().sort_values(ascending=False).
    index.values[0:20].tolist(),
    columns=FunderInfo_Ancestry.groupby(
    ['Broader Ancestral'])['Broader Ancestral'].count().
    sort_values(ascending=False).index.values.tolist())

for ancestry in FunderInfo_Ancestry.groupby(['Broader Ancestral'])['Broader Ancestral'].count().\
        sort_values(ascending=False).index.values.tolist():
    for funder in FunderInfo.groupby(['Agency'])['Agency'].count().sort_values(
            ascending=False).index.values[0:20].tolist():
        funder_ancestry[ancestry][funder] = len(
            FunderInfo_Ancestry[(FunderInfo_Ancestry['Agency'] == funder) & (
                FunderInfo_Ancestry['Broader Ancestral'] == ancestry)])

sns.set(font_scale=8, font='Times New Roman', style='white')
f, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharex=False,
                             sharey=True, figsize=(18, 11),
                             gridspec_kw={'width_ratios': [.25, .7],
                                          'wspace': 0, 'hspace': 0})

gg = sns.heatmap(funder_ancestry.astype(float), ax=ax1, fmt='.0f',
                 annot=True,  cmap='Oranges', xticklabels=True,
                 yticklabels=True, linewidth=2, robust=True, cbar=False,
                 annot_kws={'size': 10})
gg.tick_params(axis='both', which='major', labelsize=12)
hh = sns.heatmap(funder_parent.astype(float), ax=ax2, fmt='.0f',
                 annot=True, cmap='Blues', xticklabels=True,
                 yticklabels=True, linewidth=2, robust=True,
                 cbar=False, annot_kws={'size': 10})
hh.tick_params(axis='both', which='major', labelsize=12)

plt.gcf()
plt.setp(ax2.get_yticklabels(), visible=False)
plt.tight_layout()
plt.savefig(os.path.abspath(os.path.join('__file__', '../..', 'Figures', 'svg',
                                         'Funder_Heatmap.svg')), bbox_inches='tight')
sns.set(font_scale=1, font='Times New Roman')

6. Who are the Authors?

Who are the most cited authors? What is their 'GWAS H-Index' calculated based only on their papers in the GWAS catalogue? This assumes unique forename + surname combinations. First a couple of snippets:

In [23]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are a total of ' + str(len(AuthorMaster)) +
      ' "authorship contributions".')
print('These contributions are made by ' +
      str(len(AuthorMaster['AUTHORNAME'].unique())) + ' unique authors.')
print('There are a total of ' +
      str(len(AuthorMaster) /
          len(AuthorMaster['PUBMEDID'].drop_duplicates())) +
      ' "authorship contributions per paper".')
print('The study with the most number of authors has ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().max()) +
      ' authors (PubMed ID: ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().idxmax()) + ')')
There are a total of 110659 "authorship contributions".
These contributions are made by 37428 unique authors.
There are a total of 33.553365676167374 "authorship contributions per paper".
The study with the most number of authors has 559 authors (PubMed ID: 22479202)

6.1 Calculate 'GWAS-H' indexes

Lets then calculate our GWAS-H indexes using citation data and paper counts for unique authors.

In [24]:
CitationCounts = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..",
                 "Data", "PUBMED", 'Pubmed_Cites.csv')))
AuthorMaster_withcites = pd.merge(
    AuthorMaster, CitationCounts, on=['PUBMEDID'], how='left')
allauthors_formerge = pd.DataFrame(
    AuthorMaster_withcites[['AUTHORNAME', 'citedByCount']])
allauthors_papercount = pd.DataFrame(
    allauthors_formerge['AUTHORNAME'].value_counts())
allauthors_citecount = pd.DataFrame(
    allauthors_formerge.groupby(by='AUTHORNAME')['citedByCount'].sum())
allauthors_merged = pd.merge(
    allauthors_papercount, allauthors_citecount, left_index=True,
    right_index=True)
allauthors_merged.columns = ['Papers', 'citedByCount']
allauthors_merged = allauthors_merged.sort_values(
    by='citedByCount', ascending=False)
allauthors_merged['GWAS-H'] = np.NaN
counter = 0
for author in allauthors_merged.index:
    counter += 1
    temp = AuthorMaster_withcites[AuthorMaster_withcites['AUTHORNAME'] ==
                                  author].sort_values(by='citedByCount',
                                                      ascending=False).dropna()
    temp = temp.reset_index()
    temp = temp.drop('index', 1)
    for pubnumber in range(0, len(temp)):
        if pubnumber + 1 > temp['citedByCount'][pubnumber]:
            # allauthors_merged['GWAS-H'][author]=pubnumber+1
            allauthors_merged.loc[author, ('GWAS-H')] = pubnumber + 1
            break
    sys.stdout.write('\r' + 'Calculating GWAS H-indices: finished ' +
                     str(counter) + ' of ' +
                     str(len(allauthors_merged.index)) + ' authors...')
allauthors_citecount.reset_index(inplace=True)
allauthors_papercount.reset_index(inplace=True)
allauthors_papercount.rename(
    columns={'AUTHORNAME': 'PAPERCOUNT', 'index': 'AUTHORNAME', },
    inplace=True)
Calculating GWAS H-indices: finished 37428 of 37428 authors...

6.2 Calculate Author Centralities

Lets next calculate some measures of authorship centrality with Network-X. Note: this can take some time on slow computers. To get it to run faster, change the CitedByCount to be something like 100 to filter for authors with a minimum of a hundred citations only (or change PAPERCOUNT>1)

In [25]:
AuthorMaster_sumcites = pd.merge(AuthorMaster, allauthors_citecount,
                                 left_on='AUTHORNAME', right_on='AUTHORNAME',
                                 how='left')
AuthorMaster_sumcitespapercount = pd.merge(AuthorMaster_sumcites,
                                           allauthors_papercount,
                                           left_on='AUTHORNAME',
                                           right_on='AUTHORNAME', how='left')
AuthorMaster_sumcitespapercount_filter_cites = AuthorMaster_sumcitespapercount[
    AuthorMaster_sumcitespapercount['citedByCount'] > 10]
AuthorMaster_sumcitespapercount_filtered =\
    AuthorMaster_sumcitespapercount_filter_cites[
        AuthorMaster_sumcitespapercount_filter_cites['PAPERCOUNT'] > 1]

G = nx.Graph()
counter = 0
alledges = []
for paper in AuthorMaster_sumcitespapercount_filtered['PUBMEDID'].unique():
    counter += 1
    temp = AuthorMaster_sumcitespapercount_filtered[
        AuthorMaster_sumcitespapercount_filtered['PUBMEDID'] == paper]
    if len(temp) > 1:
        templist = list(itertools.combinations(temp.AUTHORNAME, 2))
        for edge in templist:
            alledges.append(edge)
G.add_edges_from(list(set(alledges)))

print('\nThis gives us a network with ' + str(len(G)) +
      ' nodes. (i.e. unique authors with more than 1 paper and 10 cumulative cites)')

betcent = pd.Series(nx.betweenness_centrality(G), name='Betweenness')
allauthors_merged = allauthors_merged.merge(
    betcent.to_frame(), left_index=True, right_index=True)

degcent = pd.Series(nx.degree_centrality(G), name='Degree')
allauthors_merged = allauthors_merged.merge(
    degcent.to_frame(), left_index=True, right_index=True)
This gives us a network with 13741 nodes. (i.e. unique authors with more than 1 paper and 10 cumulative cites)

We can then merge all this data with some data collected manually from web searches related to their country of employment, their current employer, and Web of Science searches to collect their total H-Index. We then rank in three different ways to analyze overlap between the three metrics.

In [26]:
authorsupplemental = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..",
                 "Data", "Support", 'Author_Supplmentary.csv')),
    index_col=0)
allauthors_merged_withsupp = pd.merge(allauthors_merged,
                                      authorsupplemental,
                                      left_index=True, right_index=True,
                                      how='left',)
allauthors_merged_withsupp.sort_values(by='GWAS-H',
                                       ascending=False).head(10).to_csv(
    os.path.abspath(os.path.join('__file__',
                                 "../..",
                                 "Tables", "Authors.csv")))
allauthors_merged_withsupp.sort_values(by='GWAS-H', ascending=False).head(10)
Out[26]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution
Kari Stefansson 159 24080 80.0 0.020687 0.309098 Iceland deCode Genetics
Unnur Thorsteinsdottir 131 21051 74.0 0.006808 0.247089 Iceland deCode Genetics
Albert Hofman 255 21988 72.0 0.014551 0.342795 U.S. University of Harvard
Cornelia M van Duijn 174 17741 67.0 0.008128 0.282096 Netherlands University Medical Centre Rotterdam
Christian Gieger 155 19353 66.0 0.012230 0.275691 Germany Helmholtz Zentrum München
H-Erich Wichmann 109 18395 66.0 0.009011 0.233552 Germany Helmholtz Center Munich
Gudmar Thorleifsson 111 18072 66.0 0.006659 0.236681 Iceland deCode Genetics
Fernando Rivadeneira 184 15537 62.0 0.008661 0.276929 Netherlands University Medical Centre Rotterdam
Panos Deloukas 96 16812 62.0 0.007634 0.230422 U.K. Queen Mary University of London
Mark I McCarthy 91 18402 61.0 0.003177 0.190102 U.K. University of Oxford
In [27]:
allauthors_merged_withsupp.sort_values(by='Degree',ascending=False).head(10)
Out[27]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution
Albert Hofman 255 21988 72.0 0.014551 0.342795 U.S. University of Harvard
Kari Stefansson 159 24080 80.0 0.020687 0.309098 Iceland deCode Genetics
Cornelia M van Duijn 174 17741 67.0 0.008128 0.282096 Netherlands University Medical Centre Rotterdam
Fernando Rivadeneira 184 15537 62.0 0.008661 0.276929 Netherlands University Medical Centre Rotterdam
Christian Gieger 155 19353 66.0 0.012230 0.275691 Germany Helmholtz Zentrum München
Nicholas G Martin 150 11578 55.0 0.010098 0.269796 Australia QIMR Berghofer
Grant W Montgomery 144 11798 53.0 0.009157 0.264265 Australia University of Queensland
André G Uitterlinden 157 10809 54.0 0.007174 0.262591 Netherlands Erasmus MC
Jerome I Rotter 169 15660 56.0 0.010397 0.261863 U.S. UCLA
Unnur Thorsteinsdottir 131 21051 74.0 0.006808 0.247089 Iceland deCode Genetics
In [28]:
allauthors_merged_withsupp.sort_values(by='citedByCount',
                                       ascending=False).head(10)
Out[28]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution
Kari Stefansson 159 24080 80.0 0.020687 0.309098 Iceland deCode Genetics
Albert Hofman 255 21988 72.0 0.014551 0.342795 U.S. University of Harvard
Unnur Thorsteinsdottir 131 21051 74.0 0.006808 0.247089 Iceland deCode Genetics
Christian Gieger 155 19353 66.0 0.012230 0.275691 Germany Helmholtz Zentrum München
Mark I McCarthy 91 18402 61.0 0.003177 0.190102 U.K. University of Oxford
H-Erich Wichmann 109 18395 66.0 0.009011 0.233552 Germany Helmholtz Center Munich
Gudmar Thorleifsson 111 18072 66.0 0.006659 0.236681 Iceland deCode Genetics
Cornelia M van Duijn 174 17741 67.0 0.008128 0.282096 Netherlands University Medical Centre Rotterdam
Panos Deloukas 96 16812 62.0 0.007634 0.230422 U.K. Queen Mary University of London
Nicholas J Wareham 84 15931 55.0 0.001649 0.205386 U.K University of Cambridge

And then make a simple correlation matrix:

In [29]:
allauthors_merged_withsupp[['citedByCount', 'GWAS-H',
                            'Betweenness', 'Degree', 'Papers']].corr()
Out[29]:
citedByCount GWAS-H Betweenness Degree Papers
citedByCount 1.000000 0.911133 0.512996 0.817566 0.836126
GWAS-H 0.911133 1.000000 0.592754 0.883849 0.943199
Betweenness 0.512996 0.592754 1.000000 0.497940 0.650988
Degree 0.817566 0.883849 0.497940 1.000000 0.850446
Papers 0.836126 0.943199 0.650988 0.850446 1.000000
In [30]:
print('There are a total of ' +
      str(len(allauthors_merged_withsupp)) + ' authors in the table.')
print('The person with the highest G-WAS H-Index is: ' +
      allauthors_merged_withsupp['GWAS-H'].idxmax())
print('The person with the highest Degree is: ' +
      allauthors_merged_withsupp['Degree'].idxmax())
print('The person with the highest citedByCount is: ' +
      allauthors_merged_withsupp['citedByCount'].idxmax())
print('The person with the most number of Papers is: ' +
      allauthors_merged_withsupp['Papers'].idxmax())
There are a total of 13741 authors in the table.
The person with the highest G-WAS H-Index is: Kari Stefansson
The person with the highest Degree is: Albert Hofman
The person with the highest citedByCount is: Kari Stefansson
The person with the most number of Papers is: Albert Hofman

6.3 Author Gender

Lets consider the gender of each author by analyzing their forenames using the genderguesser library. We can directly compare our results to this wonderful paper. One of the key assumptions made here is to lump all 'mostly' male and female names into their respective male/female categories.

In [31]:
gendet = gender.Detector()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()

AuthorCounts = AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count(
).to_frame().reset_index().rename(columns={"AUTHORNAME": "Author Count"})
AuthorMaster = pd.merge(AuthorMaster, AuthorCounts, how='left', on='PUBMEDID')
AuthorMaster['CleanForename'] = AuthorMaster['FORENAME'].map(
    lambda x: clean_names(x))
AuthorMaster['CleanGender'] = AuthorMaster['CleanForename'].map(
    lambda x: gendet.get_gender(x))
AuthorMaster['MaleFemale'] = AuthorMaster['CleanGender'].str.replace('mostly_',
                                                                     '')
AuthorMaster['isfemale'] = np.where(
    AuthorMaster['MaleFemale'] == 'female', 1, 0)
AuthorFirst = AuthorMaster[AuthorMaster['Author Count']
                           > 4].drop_duplicates(subset='PUBMEDID', keep='first')
AuthorLast = AuthorMaster[AuthorMaster['Author Count']
                          > 4].drop_duplicates(subset='PUBMEDID', keep='last')
AuthorUnique = AuthorMaster.drop_duplicates(subset='AUTHORNAME')

for gender_ in AuthorMaster['CleanGender'].unique():
    print(str(round((len(AuthorMaster[AuthorMaster['CleanGender'] == gender_]) /
                     len(AuthorMaster['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authorship contributions in total')
    print(str(round((len(AuthorUnique[AuthorUnique['CleanGender'] == gender_]) /
                     len(AuthorUnique['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authors contributions in total')
    print(str(round((len(AuthorFirst[AuthorFirst['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' first authors in total')
    print(str(round((len(AuthorLast[AuthorLast['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' last authors in total')

print('\nPercent of male author contributions: ' +
      str(round(len(AuthorMaster[AuthorMaster['MaleFemale'] == 'male']) /
                (len(AuthorMaster[(AuthorMaster['MaleFemale'] == 'male') |
                                  (AuthorMaster['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

print('Percent of unique male authors: ' +
      str(round(len(AuthorUnique[AuthorUnique['MaleFemale'] == 'male']) /
                (len(AuthorUnique[(AuthorUnique['MaleFemale'] == 'male') |
                                  (AuthorUnique['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

print('Percent of male first authors: ' +
      str(round(len(AuthorFirst[AuthorFirst['MaleFemale'] == 'male']) /
                (len(AuthorFirst[(AuthorFirst['MaleFemale'] == 'male') |
                                 (AuthorFirst['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

print('Percent of male last authors: ' +
      str(round(len(AuthorLast[AuthorLast['MaleFemale'] == 'male']) /
                (len(AuthorLast[(AuthorLast['MaleFemale'] == 'male') |
                                (AuthorLast['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

AuthorMaster_filtered = AuthorMaster[(AuthorMaster['MaleFemale'].str.contains(
    'male')) & (AuthorMaster['Author Count'] > 4)]
AuthorMaster_filtered_merged_bydisease = pd.merge(
    AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'DISEASE/TRAIT']],
    how='left', on='PUBMEDID')
AuthorMaster_filtered_merged_bymappedtrait = pd.merge(
    AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'MAPPED_TRAIT']],
    how='left', on='PUBMEDID')

meanfemales_bydisease = AuthorMaster_filtered_merged_bydisease.groupby(
    ['DISEASE/TRAIT'])['isfemale'].mean().to_frame()
numberofstudies_bydisease = Cat_Studies.groupby(
    ['DISEASE/TRAIT'])['DISEASE/TRAIT'].count().to_frame()
mergedgender_andcount_bydisease = pd.merge(
    numberofstudies_bydisease, meanfemales_bydisease, left_index=True,
    right_index=True)
mergedgender_andcount_bydisease = mergedgender_andcount_bydisease.sort_values(
    by='DISEASE/TRAIT', ascending=False)[0:50]
holdstring = 'Percent of authorships across 50 most commonly studied diseases \
    (from the raw Cat_Studies file merged with PUBMEDID and gender guessed): '
for index, row in mergedgender_andcount_bydisease.iterrows():
    holdstring = holdstring + \
        index.title() + ' (' + str(round(row['isfemale'], 3) * 100) + '%), '
print('\n' + holdstring[:-2])
26.05% female authorship contributions in total
25.24% female authors contributions in total
26.81% female first authors in total
19.34% female last authors in total
45.5% male authorship contributions in total
37.86% male authors contributions in total
35.39% male first authors in total
49.27% male last authors in total
4.49% andy authorship contributions in total
5.1% andy authors contributions in total
7.25% andy first authors in total
3.66% andy last authors in total
1.35% mostly_male authorship contributions in total
1.51% mostly_male authors contributions in total
1.33% mostly_male first authors in total
1.8% mostly_male last authors in total
21.2% unknown authorship contributions in total
28.78% unknown authors contributions in total
27.98% unknown first authors in total
24.14% unknown last authors in total
1.41% mostly_female authorship contributions in total
1.52% mostly_female authors contributions in total
1.24% mostly_female first authors in total
1.8% mostly_female last authors in total

Percent of male author contributions: 63.047%
Percent of unique male authors: 59.533%
Percent of male first authors: 56.699%
Percent of male last authors: 70.73%

Percent of authorships across 50 most commonly studied diseases     (from the raw Cat_Studies file merged with PUBMEDID and gender guessed): Type 2 Diabetes (32.2%), Breast Cancer (49.5%), Schizophrenia (29.599999999999998%), Body Mass Index (38.800000000000004%), Colorectal Cancer (34.1%), Prostate Cancer (40.300000000000004%), Asthma (36.7%), Alzheimer'S Disease (41.0%), Height (36.9%), Hdl Cholesterol (36.0%), Bipolar Disorder (33.6%), Parkinson'S Disease (34.4%), Ldl Cholesterol (35.5%), Systemic Lupus Erythematosus (37.1%), Rheumatoid Arthritis (33.800000000000004%), Triglycerides (36.5%), Coronary Heart Disease (28.999999999999996%), Crohn'S Disease (31.1%), Multiple Sclerosis (38.3%), Amyotrophic Lateral Sclerosis (33.0%), Alcohol Dependence (32.2%), Hypertension (30.3%), Ulcerative Colitis (31.6%), Bone Mineral Density (33.2%), Major Depressive Disorder (35.0%), Lung Cancer (36.5%), Blood Pressure (29.9%), Attention Deficit Hyperactivity Disorder (29.799999999999997%), Diastolic Blood Pressure (31.3%), Systolic Blood Pressure (31.3%), Tuberculosis (33.6%), Alzheimer'S Disease (Late Onset) (31.3%), Adiponectin Levels (35.099999999999994%), Telomere Length (38.9%), Age-Related Macular Degeneration (34.699999999999996%), Atrial Fibrillation (26.8%), Psoriasis (27.200000000000003%), Type 1 Diabetes (34.300000000000004%), Smoking Behavior (32.9%), Endometriosis (33.900000000000006%), Fasting Plasma Glucose (36.6%), Qt Interval (29.7%), Nicotine Dependence (39.6%), Glycated Hemoglobin Levels (33.300000000000004%), Glaucoma (Primary Open-Angle) (32.300000000000004%), Platelet Count (36.8%), Cholesterol, Total (36.3%), Menarche (Age At Onset) (44.800000000000004%), Pancreatic Cancer (41.4%), Urate Levels (33.800000000000004%)
In [32]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

AuthorMaster_merged = pd.merge(AuthorMaster[(
    AuthorMaster['MaleFemale'] == 'male') |
    (AuthorMaster['MaleFemale'] == 'female')],
    EFO_Parent_Mapped, how='left', on='PUBMEDID')

AuthorMaster_EFO = AuthorMaster_merged.groupby(
    ['EFO term'])['isfemale'].mean().to_frame()
AuthorMaster_Parent = AuthorMaster_merged.groupby(
    ['Parent term'])['isfemale'].mean().to_frame()

countstudiesperEFO = EFO_Parent_Mapped.groupby(['EFO term'])['PUBMEDID'].count(
).sort_values(ascending=False)[0:17].index.tolist()

meanfemalestudyaccession = AuthorMaster_merged.groupby(
    ['STUDY ACCESSION'])['isfemale'].mean().to_frame().reset_index()

meanfemalestudyaccession_withparent = pd.merge(EFO_Parent_Mapped[[
    'STUDY ACCESSION', 'Parent term']], meanfemalestudyaccession,
    on='STUDY ACCESSION', how='left')
meanfemalestudyaccession_withparent = meanfemalestudyaccession_withparent[
    ~meanfemalestudyaccession_withparent['isfemale'].isnull()]

meanfemalestudyaccession_withEFO = pd.merge(EFO_Parent_Mapped[[
    'STUDY ACCESSION', 'EFO term']], meanfemalestudyaccession,
    on='STUDY ACCESSION', how='left')
meanfemalestudyaccession_withEFO = meanfemalestudyaccession_withEFO[
    meanfemalestudyaccession_withEFO['EFO term'].isin(
        countstudiesperEFO)]
meanfemalestudyaccession_withEFO = meanfemalestudyaccession_withEFO[
    ~meanfemalestudyaccession_withEFO['isfemale'].isnull()]

AuthorMaster_EFO.reset_index(inplace=True)
ranks = AuthorMaster_EFO[AuthorMaster_EFO['EFO term'].isin(
    countstudiesperEFO)].sort_values(by='isfemale',
                                     ascending=False)['EFO term'].tolist()
sns.set(font_scale=1.2, font="Times New Roman", style='ticks')

fig = plt.figure(figsize=(12, 6), dpi=800)
ax1 = fig.add_subplot(1, 1, 1)

p = sns.boxplot(data=meanfemalestudyaccession_withEFO,
                y='EFO term', ax=ax1, order=ranks,
                x='isfemale', whis=1.5, palette="vlag", fliersize=0)
x = sns.swarmplot(y='EFO term', x='isfemale',
                  data=meanfemalestudyaccession_withEFO, order=ranks,
                  size=2.25, color="white", edgecolor="black", linewidth=0.5,
                  ax=ax1)
p.set_ylabel('')
p.set_xlim(0, 1.05)
p.set(axis_bgcolor='white')
p.set_xlabel('Female/Male Ratio Per Paper Across Mapped Terms')
plt.setp(p.spines.values(), color='k')
plt.setp([p.get_xticklines(), p.get_yticklines()], color='k')
sns.despine(bottom=False, left=False, offset=10, ax=ax1)

plt.tight_layout()
plt.savefig(os.path.abspath(os.path.join('__file__', "../..", "Figures", "svg",
                                         'Gender_by_Subjects.svg')),
            bbox_inches='tight')

sns.set(font_scale=1, font="Times New Roman")
In [33]:
AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms from figure above: '
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 3) * 100) + '%), '
print('\n' + holdstring[:-2])

holdstring = 'Get the numbers for Trait Terms from figure above: '
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 3) *
                                             100) + '%), '
print('\n' + holdstring[:-2])
Get the numbers for Parent Terms from figure above: Biological Process (40.9%), Body Meas. (41.099999999999994%), Cancer (47.199999999999996%), Cardiovascular Disease (32.6%), Cardiovascular Meas. (32.800000000000004%), Digestive System Disorder (33.7%), Hematological Meas. (34.2%), Immune System Disorder (37.2%), Inflammatory Meas. (33.800000000000004%), Lipid/Lipoprotein Meas. (36.8%), Liver Enzyme Meas. (37.2%), Metabolic Disorder (33.7%), Neurological Disorder (35.5%), Other Disease (33.800000000000004%), Other Meas. (38.4%), Other Trait (35.6%), Response To Drug (35.9%)

Get the numbers for Trait Terms from figure above: Breast Carcinoma (50.7%), Smoking Behavior (43.0%), Prostate Carcinoma (41.5%), Body Mass Index (40.8%), Multiple Sclerosis (40.2%), Unipolar Depression (37.4%), Alzheimers Disease (37.1%), Bipolar Disorder (36.9%), Triglyceride Meas. (36.8%), Colorectal Cancer (36.8%), Asthma (36.6%), Coronary Heart Disease (36.4%), Hd Lc Measurement (36.199999999999996%), Ld Lc Measurement (34.8%), Hypertension (32.6%), Type Ii Diabetes Mellitus (31.900000000000002%), Schizophrenia (31.0%)

7. Cohorts and Consortium

7.1 Consortium/Collectives

Lets use the PubMed database returns on collectives to analyze which consortium and study groups are featuring:

In [34]:
pubmed_collectives = pd.read_csv(os.path.abspath(
                                 os.path.join('__file__', '../..',
                                              'Data', 'PUBMED',
                                              'Pubmed_CollectiveInfo.csv')),
                                 encoding='latin-1')
collect_dict = pd.read_csv(os.path.abspath(
                           os.path.join(
                               '__file__', '../..', 'Data',
                               'Support', 'Collectives',
                               'Collective_Dictionary.csv')),
                           encoding='latin-1')
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.strip()
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.lower()
collect_dict['COLLECTIVE'] = collect_dict['COLLECTIVE'].str.strip(
).str.lower()
merged_collect = pd.merge(
    pubmed_collectives, collect_dict, how='left', on='COLLECTIVE')
if len(merged_collect[merged_collect['Clean_Name'].isnull()]) > 0:
    print('Danger! Correct collectives in Collective_Unverified.csv ' +
          'and add to the dictionary\n')
    unverified_collect = merged_collect[merged_collect['Clean_Name'].isnull(
    )]
    unverified_collect.to_csv(os.path.abspath(
                              os.path.join('__file__', '../..',
                                           'Data', 'Support', 'Collectives',
                                           'Collective_Unverified.csv')))
else:
    print('The consortium dictionary is up to date!\n')
consortiumlist = []
for index, row in merged_collect.iterrows():
    if pd.isnull(row['Clean_Name']) is False:
        for consortium in row['Clean_Name'].split(';'):
            consortiumlist.append(consortium)
clean_collectives = pd.DataFrame(consortiumlist, columns=['Consortium'])
groupby_collectives = clean_collectives.groupby(
    ['Consortium'])['Consortium'].count().sort_values(ascending=False)
holdstring = 'The most frequently seen Consortium are: '
for index, row in groupby_collectives.to_frame()[0:50].iterrows():
    holdstring = holdstring + index + ' (' + str(row['Consortium']) + '), '
print(holdstring[:-2] + '.')

print('\nWe have seen a total of ' +
      str(len(groupby_collectives.to_frame())) + ' different collectives!')
print('A total of ' + str(len(pubmed_collectives['PUBMEDID'].drop_duplicates(
))) + ' papers are contributed to by at least one collective.')
print('A total of ' + str(groupby_collectives.sum()) +
      ' collective contributions are made.')
The consortium dictionary is up to date!

The most frequently seen Consortium are: Wellcome Trust Case Control Consortium (42), CHARGE (38), Wellcome Trust Case Control Consortium 2 (32), DIAGRAM (28), LifeLines Cohort Study (28), Alzheimer's Disease Neuroimaging Initiative (25), CARDIOGRAM (24), MAGIC (19), kConFab (16), GIANT Consortium (16), Australian Ovarian Cancer Study (13), DCCT/EDIC (13), MuTHER (13), International Stroke Genetics Consortium (12), PRACTICAL consortium (11), International MS Genetics Consortium (11), Alzheimers Disease Neuroimaging Initiative (11), Early Growth Genetics Consortium (11), CKDGen Consortium (9), arcOGEN Consortium (9), GENICA (9), Ovarian Cancer Association Consortium (9), EArly Genetics and Lifecourse Epidemiology (EAGLE) Consortium (9), Psychiatric Genomics Consortium (8), Cardiogenics (8), EPIC Study (8), AOCS Group (7), EUROSPAN Consortium (7), Bipolar Genome Study (7), Australian Cancer Study (7), Global BPgen Consortium (7), COPDGene Investigators (7), 23andMe (7), CORGI Consortium (6), Genetic Investigation of ANthropometric Traits Consortium (6), ReproGen Consortium (6), Blue Mountains Eye Study (6), EMBRACE (6), ENGAGE Consortium (6), KORA (6), GenoMEL (6), HEBON (6), Australian Asthma Genetics Consortium (6), ICBP Consortium (6), Alzheimer's Disease Genetic Consortium (6), International Consortium for Blood Pressure (5), International Parkinson Disease Genomics Consortium (5), Meta Analysis of Glucose and Insulin Related Traits Consortium (5), ECHOGen Consortium (5), Nagahama Cohort Study Group (5).

We have seen a total of 612 different collectives!
A total of 753 papers are contributed to by at least one collective.
A total of 1429 collective contributions are made.

7.2 Datasets

Lets now simply wrangle together some of the manually collated data. First lets order and consider the most frequently observed cohorts, with the aim being to analyze the resampling issue discussed earlier in the literature

In [35]:
finaldataset = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..', 'Data',
                 'Support', 'Cohorts',
                 'GWASCat1to1000 final.csv')),
    encoding='latin-1',
    index_col='Number')
mylist = []
for index, row in finaldataset.iterrows():
    mylist = mylist + row['DATASETS'].split(';')

mylist = [x.strip(' ') for x in mylist]
df1 = pd.DataFrame({'Cohorts': mylist})
reader = csv.reader(open(os.path.abspath(os.path.join(
    '__file__', '../..', 'Data', 'Support', 'Cohorts',
    'Dictionary_cohorts.csv')), 'r'))
d = {}
for row in reader:
    k, v = row
    d[k] = v
for key in d:
    df1['Cohorts'].replace(key, d[key], inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.replace(
    'Rotterdam Study I (RS-I)', 'Rotterdam Study (RS)')
newdf = pd.DataFrame(df1.groupby(['Cohorts'])[
                     'Cohorts'].count(), columns=['Count'])

newdf = pd.DataFrame({'Count': df1.groupby(['Cohorts']).size()}).reset_index()

Then merge this with manually collaged data:

In [36]:
manual_cohort_for_merge = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..", "Data",
                 "Support", "Cohorts",
                 "manual_cohort_for_merge.csv")),
    encoding='latin-1', index_col=False)

merged_manual = pd.merge(newdf, manual_cohort_for_merge,
                         how='left', on='Cohorts')
merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME').head(10)
Out[36]:
Count N Ancestry Country of Recruitment Age Range Study Design
Cohorts
Rotterdam Study (RS) 373 14926.0 Netherlands Netherlands 55-106 Prospective cohort
Atherosclerosis Risk in Communities Study (ARIC) 187 15792.0 NaN US 45-64 Prospective cohort, Community
Framingham Heart Study (FHS) 183 15447.0 US US 30-62 (original) now 5-85 Prospective cohort, three generation
Cardiovascular Health Study (CHS) 163 5888.0 US US 65+ Prospective cohort
British 1958 Birth Cohort Study (B58C) 144 17634.0 England, Wales, Scotland UK 0+ Prospective birth cohort
Study of Health in Pomerania (SHIP) 120 4308.0 German Germany 20-79 Prospective cohort
UK Adult Twin Register (TwinsUK) 120 12000.0 United Kingdom, Ireland UK, Ireland 18-103 Longitudinal entry at various times
Nurses Health Study (NHS) 112 121700.0 US US 30-55 Prospective cohort
Aging Gene-Environment Susceptibility - Reykjavik Study (AGES) 110 30795.0 Iceland Iceland 33-84 Prospective cohort
European Prospective Investigation into Cancer (EPIC) 108 521330.0 10 European Countries; France, Germany, Greece... 10 EU countries 21-83 (varies by country) Prospective cohort
In [37]:
print('We guess there are a total of ' + str(len(newdf)) +
      ' different cohorts used in the 1000 biggest studies.')
We guess there are a total of 1852 different cohorts used in the 1000 biggest studies.

Lets now print out a list of the 100 most commonly used datasets in case we need to reference any specific particularily prominent ones (e.g. deCODE, UKBioBank etc):

In [38]:
merged_manual=merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME')[['Count']]

holder='The hundred most commonly used datasets are: '
for index, row in merged_manual[0:100].iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Count']) + ' times), '
        
print(holder[:-2])
The hundred most commonly used datasets are: Rotterdam Study (RS) (373 times), Atherosclerosis Risk in Communities Study (ARIC) (187 times), Framingham Heart Study (FHS) (183 times), Cardiovascular Health Study (CHS) (163 times), British 1958 Birth Cohort Study (B58C) (144 times), Study of Health in Pomerania (SHIP) (120 times), UK Adult Twin Register (TwinsUK) (120 times), Nurses Health Study (NHS) (112 times), Aging Gene-Environment Susceptibility - Reykjavik Study (AGES) (110 times), European Prospective Investigation into Cancer (EPIC) (108 times), Wellcome Trust Case Control Consortium (WTCCC) (103 times), deCODE Genetics (deCODE) (101 times), Erasmus Rucphen Family Study (ERF) (94 times), Northern Finnish Birth Cohort Study (NFBC) (83 times), Cooperative Health Research in the Region of Augsburg (KORA) (81 times), Health, Aging, and Body Composition Study (Health ABC) (81 times), Multi-Ethnic Study of Atherosclerosis (MESA) (79 times), Netherlands Twin Registry (NTR) (74 times), Lothian Birth Cohort (LBC) (72 times), Orkney Complex Disease Study (ORCADES) (71 times), Estonian Genome Center of the University of Tartu (EGCUT) (70 times), Cohorte Lausannoise (CoLaus) (69 times), SardiNIA Study of Aging (SardiNIA) (66 times), The Invecchiare in Chianti (InCHIANTI) (66 times), Health Professional Follow-up Study (HPFS) (65 times), Womens Genome Health Study (WGHS) (63 times), Avon Longitudinal Study of Parents and Children (ALSPAC) (62 times), Cardiovascular Risk in Young Finns Study (YFS) (62 times), BioBank Japan Project (BBJ) (61 times), Helsinki Birth Cohort Study (HBCS) (54 times), Baltimore Longitudinal Study of Ageing (BLSA) (53 times), CROATIA-Korcula Study (CROATIA-KORCULA) (53 times), Womens Health Initiative (WHI) (53 times), Cooperative Health Research in the Region of Augsburg F3 (KORA F3) (52 times), Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) (51 times), Queensland Institute of Medical Research (QIMR) (50 times), Coronary Artery Risk Development in Young Adults (CARDIA) (48 times), London Life Sciences Population Study (LOLIPOP) (47 times), Cooperative Health Research in the Region of Augsburg F4 (KORA F4) (45 times), Netherlands Study of Depression and Anxiety (NESDA) (43 times), LifeLines (LifeLines) (43 times), CROATIA-Split Study (CROATIA-SPLIT) (42 times), Micro-isolates in South Tyrol Study (MICROS) (41 times), CROATIA-Vis Island Study (CROATIA-VIS) (40 times), Multiethnic Cohort Study of Diet and Health (MEC) (40 times), Family Heart Study (FamHS) (39 times), Prospective Study of Pravastatin in the Elderly at Risk (PROSPER) (37 times), Precocious Coronary Artery Disease Study (PROCARDIS) (37 times), The Population Genetic Cohort (POPGEN) (36 times), Fenland study (Fenland) (36 times), Prevention of Renal and Vascular Endstage Disease study (PREVEND) (36 times), Study of Epidemiology and Risk Factors in Cancer Heredity (SEARCH) (35 times), FINRISK (FINRISK) (35 times), Finland-United States Investigation of Non-Insulin-Dependent Diabetes Mellitus Genetics (FUSION) (35 times), Western Australian Pregnancy Cohort Study (RAINE) (34 times), Health 2000 Study (Health 2000) (34 times), Cooperative Health Research in the Region of Augsburg S4 (KORA S4) (34 times), Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes of BioMedical Research (DGI) (33 times), German Myocardial Infarction Family Study (GerMIFS) (33 times), Jackson Heart Study (JHS) (33 times), The older order AMISH population studies (AMISH) (33 times), Austrian Stroke Prevention Study (ASPS) (33 times), Melbourne Collaborative Cohort Study (MCCS) (31 times), Nijmegen Biomedical Study (NBS) (30 times), Singapore Malay Eye Study (SIMES) (29 times), Genetic Epidemiology Network of Arteriopathy (GENOA) (29 times), Busselton Health Study (BHS) (29 times), 23andMe (23andMe) (28 times), Cancer Prevention Study (CPS) (27 times), Finnish Twin Cohort Study (FTC) (27 times), Singapore Prospective Study Program (SP2) (26 times), Childrens Hospital of Philadelphia (CHOP) (26 times), Blue Mountain Eye Study (BMES) (26 times), Icelandic Cancer Registry (ICR) (26 times), UK National Blood Donor Service (UKBS) (26 times), Sorbs population (SORBS) (25 times), Heinz Nixdorf Recall Study (HNR) (25 times), Northern Sweden Population Health Study (NSPHS) (25 times), Cooperative Health Research in the Region of Augsburg S3 (KORA S3) (24 times), Prospective Investigation of the Vasculature in Uppsala Seniors (PIVUS) (24 times), Epidemiological Study on the Insulin Resistance syndrome study (DESIR) (24 times), Ludwigshafen Risk and Cardiovascular Health (LURIC) (24 times), Korean Association Resource Project (KARE) (24 times), Hunter Community Study (HCS) (24 times), Religious Orders Study and Rush Memory and Aging Project (ROSMAP) (23 times), The Mayo Clinic study (MAYO) (23 times), Cancer Prostate in Sweden (CAPS) (22 times), British Genetics of Hypertension study (BRIGHT) (22 times), Genetic STudy of Aspirin Responsiveness (GeneSTAR) (22 times), Hypertension Genetic Epidemiology Network (HYPERGEN) (22 times), Singapore Chinese Eyes Study (SCES) (21 times), Health and Retirement Study (HRS) (21 times), Metabolic Syndrome In Men (METSIM) (21 times), Genetic Association Information Network (GAIN) (20 times), Healthy Aging in Neighborhoods of Diversity across the Life Span Study (HANDLS) (20 times), Uppsala Longitudinal Study of Adult Men (ULSAM) (20 times), UK Biobank (UKB) (20 times), Swedish Twin Registry (STR) (20 times), Gothenburg Osteoporosis and Obesity Determinants study (GOOD) (19 times), Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC) (19 times)

Appendix for Robustness

As a bit or robustness, lets compare results which drop comma delimited traits in the raw catalog files

In [39]:
#from Support.Robustness import traits_robustness
#from Support.Robustness import gender_robustness
#from Support.Robustness import funder_robustness
#EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(Cat_Studies, Cat_Ancestry_groupedbyN)
#Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
#FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
In [40]:
#traits_robustness(EFO_Parent_Mapped_NoCommas)
In [41]:
#gender_robustness(AuthorMaster,EFO_Parent_Mapped_NoCommas)
In [42]:
#funder_robustness(EFO_Parent_Mapped_NoCommas, FunderInfo, Cat_Ancestry)